Spark Master and Workers not Connecting via Localhost Addresses - apache-spark

After install of the Spark package at Linux (SuSE SLES 12) I see the following connectivity error ("failed to connect"), which beside the Spark slave process also impacts the "pyspark" examples, rejecting connections. Any hint how to activate the port 7077 connectivity via localhost addresses is welcome. Part of the problem might be the default Linux firewall settings.
Firewall Commands to Open Localhost addresses:
sudo iptables -A INPUT -s 127.0.0.1 -d 127.0.0.1 -j ACCEPT
sudo iptables -A INPUT -s 127.0.0.1 -d zbra2016 -j ACCEPT
Starting the Spark Master - commands:
export SPARK_LOCAL_IP=zbra2016
./sbin/stop-master.sh
./sbin/start-master.sh
16/04/19 10:12:29 INFO Master: Registered signal handlers for [TERM, HUP, INT]
16/04/19 10:12:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/19 10:12:29 INFO SecurityManager: Changing view acls to: linux1
16/04/19 10:12:29 INFO SecurityManager: Changing modify acls to: linux1
16/04/19 10:12:29 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(linux1); users with modify permissions: Set(linux1)
16/04/19 10:12:30 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
16/04/19 10:12:30 INFO Master: Starting Spark master at spark://zbra2016:7077
16/04/19 10:12:30 INFO Master: Running Spark version 1.6.1
16/04/19 10:12:30 WARN Utils: Service 'MasterUI' could not bind on port 8080. Attempting port 8081.
16/04/19 10:12:30 INFO Utils: Successfully started service 'MasterUI' on port 8081.
16/04/19 10:12:30 INFO MasterWebUI: Started MasterWebUI at http://localhost:8081
16/04/19 10:12:30 INFO Utils: Successfully started service on port 6066.
16/04/19 10:12:30 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
16/04/19 10:12:31 INFO Master: I have been elected leader! New state: ALIVE
Starting the Spark Worker - commands:
./sbin/stop-slave.sh
./sbin/start-slave.sh spark://zbra2016:7077
Logfile displays a "Failed to Connect Error Message":
/data/spark/spark/logs/spark-linux1-org.apache.spark.deploy.worker.Worker-1-zbra2016.out
16/04/19 10:15:46 INFO Worker: Retrying connection to master (attempt # 1)
16/04/19 10:15:46 INFO Worker: Connecting to master zbra2016:7077...
16/04/19 10:15:47 WARN Worker: Failed to connect to master zbra2016:7077
java.io.IOException: Failed to connect to zbra2016/127.0.0.1:7077
Testing connectivity of alias: zbra2016 = localhost
linux1#zbra2016:/data/spark/spark> ping zbra2016
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.022 ms

We just found a solution for it in the setup of the Linux iptables firewall. I used the following command to open localhost traffic:
iptables -I INPUT 1 -p all -s localhost -d localhost -j ACCEPT
Now the worker process is able to connect to the master through the localhost ports.

You may be able to change the settings allowing port 7077 through your firewall.
Try:
sudo ufw allow 7077

Related

How to solve "no org.apache.spark.deploy.worker.Worker to stop" issue?

I a using a spark standalone in Google Cloud, composed of 1 master and 4 worker nodes. When I start the cluster. I can see the master and worker running. But when I try to stop-all, I get the following issue. Maybe this the reason I cannot run spark-submit. How to solve this issue. The following are the terminal screen.
sparkuser#master:/opt/spark/logs$ jps
1867 Jps
sparkuser#master:/opt/spark/logs$ start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out
worker4: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker4.out
worker1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker1.out
worker2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker2.out
worker3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker3.out
sparkuser#master:/opt/spark/logs$ jps -lm
1946 sun.tools.jps.Jps -lm
1886 org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080
sparkuser#master:/opt/spark/logs$ cat spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out
Spark Command: /usr/lib/jvm/jdk1.8.0_202/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host master --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/10/13 04:28:23 INFO Master: Started daemon with process name: 1886#master
22/10/13 04:28:23 INFO SignalUtils: Registering signal handler for TERM
22/10/13 04:28:23 INFO SignalUtils: Registering signal handler for HUP
22/10/13 04:28:23 INFO SignalUtils: Registering signal handler for INT
22/10/13 04:28:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/13 04:28:24 INFO SecurityManager: Changing view acls to: sparkuser
22/10/13 04:28:24 INFO SecurityManager: Changing modify acls to: sparkuser
22/10/13 04:28:24 INFO SecurityManager: Changing view acls groups to:
22/10/13 04:28:24 INFO SecurityManager: Changing modify acls groups to:
22/10/13 04:28:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sparkuser); groups with view permissions: Set(); users with modify permissions: Set(sparkuser); groups with modify permissions: Set()
22/10/13 04:28:24 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
22/10/13 04:28:24 INFO Master: Starting Spark master at spark://master:7077
22/10/13 04:28:24 INFO Master: Running Spark version 3.2.2
22/10/13 04:28:25 INFO Utils: Successfully started service 'MasterUI' on port 8080.
22/10/13 04:28:25 INFO MasterWebUI: Bound MasterWebUI to 127.0.0.1, and started at http://localhost:8080
22/10/13 04:28:25 INFO Master: I have been elected leader! New state: ALIVE
sparkuser#master:/opt/spark/logs$ stop-all.sh
worker2: no org.apache.spark.deploy.worker.Worker to stop
worker4: no org.apache.spark.deploy.worker.Worker to stop
worker1: no org.apache.spark.deploy.worker.Worker to stop
worker3: no org.apache.spark.deploy.worker.Worker to stop
stopping org.apache.spark.deploy.master.Master
sparkuser#master:/opt/spark/logs$

How to fix the "ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]" in Google cloud VMs

I am trying to create a cluster of VM instances in Google cloud. There are 4 worker nodes and 1 master node.
Things that I have configured:
Created "sparkuser" and given sudo privileges
Installed same version of Java JDK and JRE in all machines and configured the path.
Installed same version of Scala and sparks.
Hosts file and host name added, able to ssh between each machines.
Configured the "spark-env.sh" and "slaves" file in spark on each machines
However, when I try to run this bash command "start-master.sh" it starts all the VM's spark in cluster. But with the jps command I cannot see any master and workers, on checking the file in: /spark/log
The log file contains the error and I tried to solve it with various ways found in the developers' community. Unfortunately, I am still not able to solve the issue:
I am adding the log file here:
sparkuser#master:~$ start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out
worker4: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker4.out
worker3: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker3.out
worker2: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker2.out
worker1: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.worker.Worker-1-worker1.out
sparkuser#master:~$ jps
3280 Jps
sparkuser#master:~$ cat /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out.6
cat: /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out.6: No such file or directory
sparkuser#master:~$ cat /opt/spark/logs/spark-sparkuser-org.apache.spark.deploy.master.Master-1-master.out.5
Spark Command: /usr/lib/jvm/java-11-openjdk-amd64/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host 35.216.27.9 --port 7100 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/09/30 07:09:21 INFO Master: Started daemon with process name: 3913#master
22/09/30 07:09:21 INFO SignalUtils: Registering signal handler for TERM
22/09/30 07:09:21 INFO SignalUtils: Registering signal handler for HUP
22/09/30 07:09:21 INFO SignalUtils: Registering signal handler for INT
22/09/30 07:09:22 WARN Utils: Your hostname, master resolves to a loopback address: 127.0.0.1; using 10.178.0.3 instead (on interface ens4)
22/09/30 07:09:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/opt/spark/jars/spark-unsafe_2.12-3.2.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
22/09/30 07:09:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/09/30 07:09:22 INFO SecurityManager: Changing view acls to: sparkuser
22/09/30 07:09:22 INFO SecurityManager: Changing modify acls to: sparkuser
22/09/30 07:09:22 INFO SecurityManager: Changing view acls groups to:
22/09/30 07:09:22 INFO SecurityManager: Changing modify acls groups to:
22/09/30 07:09:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(sparkuser); groups with view permissions: Set(); users with modify permissions: Set(sparkuser); groups with modify permissions: Set()
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7100. Attempting port 7101.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7101. Attempting port 7102.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7102. Attempting port 7103.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7103. Attempting port 7104.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7104. Attempting port 7105.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7105. Attempting port 7106.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7106. Attempting port 7107.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7107. Attempting port 7108.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7108. Attempting port 7109.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7109. Attempting port 7110.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7110. Attempting port 7111.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7111. Attempting port 7112.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7112. Attempting port 7113.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7113. Attempting port 7114.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7114. Attempting port 7115.
22/09/30 07:09:23 WARN Utils: Service 'sparkMaster' could not bind on port 7115. Attempting port 7116.
22/09/30 07:09:23 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
java.net.BindException: Cannot assign requested address: Service 'sparkMaster' failed after 16 retries (starting from 7100)! Consider explicitly setting the appropriate port for the service 'sparkMaster' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
at java.base/sun.nio.ch.Net.bind0(Native Method)
at java.base/sun.nio.ch.Net.bind(Net.java:459)
at java.base/sun.nio.ch.Net.bind(Net.java:448)
at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:562)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:260)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
22/09/30 07:09:23 INFO ShutdownHookManager: Shutdown hook called
On spark/conf/spark-env.sh file add these following:
export SPARK_LOCAL_IP="127.0.0.1"
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WOKER_DIR=/opt/spark/conf/slaves "user-case based path"
export SPARK_LOG_DIR=/opt/spark/logs
Along with that please ensure that you are able to SSH between all machines.
If you run scp among the machines and it runs without any error then the cluster will start. If SSH is working, but SCP is not working then remove the pub_keys and start over the key exchange process.
I hope this works.
It worked for me.

Spark worker does not resolve host on ECS but works fine with IP address

I'm trying to run Spark 3.0 on AWS ECS on EC2. I have a spark-worker service and a spark-master service. When try I run the worker with the hostname of the master (as exposed through ECS service discovery), it fails to resolve. When I put the hard-coded IP address/port, it works.
Here are some commands I ran inside the worker docker container after I ssh'ed into the EC2 backing the ECS:
# as can be seen below, the master host is reachable from the worker Docker container
root#b87fad6a3ffa:/usr/spark-3.0.0# ping spark_master.mynamespace
PING spark_master.mynamespace (172.21.60.11) 56(84) bytes of data.
64 bytes from ip-172-21-60-11.eu-west-1.compute.internal (172.21.60.11): icmp_seq=1 ttl=254 time=0.370 ms
# the following works just fine -- starting the worker successfully and connecting to the master:
root#b87fad6a3ffa:/usr/spark-3.0.0# /bin/sh -c "bin/spark-class org.apache.spark.deploy.worker.Worker spark://172.21.60.11:7077"
# !!! this is the fail
root#b87fad6a3ffa:/usr/spark-3.0.0# /bin/sh -c "bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark_master.mynamespace:7077"
20/07/01 21:03:41 INFO worker.Worker: Started daemon with process name: 422#b87fad6a3ffa
20/07/01 21:03:41 INFO util.SignalUtils: Registered signal handler for TERM
20/07/01 21:03:41 INFO util.SignalUtils: Registered signal handler for HUP
20/07/01 21:03:41 INFO util.SignalUtils: Registered signal handler for INT
20/07/01 21:03:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/07/01 21:03:42 INFO spark.SecurityManager: Changing view acls to: root
20/07/01 21:03:42 INFO spark.SecurityManager: Changing modify acls to: root
20/07/01 21:03:42 INFO spark.SecurityManager: Changing view acls groups to:
20/07/01 21:03:42 INFO spark.SecurityManager: Changing modify acls groups to:
20/07/01 21:03:42 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
20/07/01 21:03:42 INFO util.Utils: Successfully started service 'sparkWorker' on port 39915.
20/07/01 21:03:42 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
org.apache.spark.SparkException: Invalid master URL: spark://spark_master.mynamespace:7077
at org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:2397)
at org.apache.spark.rpc.RpcAddress$.fromSparkURL(RpcAddress.scala:47)
at org.apache.spark.deploy.worker.Worker$.$anonfun$startRpcEnvAndEndpoint$3(Worker.scala:859)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
at org.apache.spark.deploy.worker.Worker$.startRpcEnvAndEndpoint(Worker.scala:859)
at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:828)
at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
20/07/01 21:03:42 INFO util.ShutdownHookManager: Shutdown hook called
# following is just FYI
root#b87fad6a3ffa:/usr/spark-3.0.0# /bin/sh -c "bin/spark-class org.apache.spark.deploy.worker.Worker --help"
20/07/01 21:16:10 INFO worker.Worker: Started daemon with process name: 552#b87fad6a3ffa
20/07/01 21:16:10 INFO util.SignalUtils: Registered signal handler for TERM
20/07/01 21:16:10 INFO util.SignalUtils: Registered signal handler for HUP
20/07/01 21:16:10 INFO util.SignalUtils: Registered signal handler for INT
Usage: Worker [options] <master>
Master must be a URL of the form spark://hostname:port
Options:
-c CORES, --cores CORES Number of cores to use
-m MEM, --memory MEM Amount of memory to use (e.g. 1000M, 2G)
-d DIR, --work-dir DIR Directory to run apps in (default: SPARK_HOME/work)
-i HOST, --ip IP Hostname to listen on (deprecated, please use --host or -h)
-h HOST, --host HOST Hostname to listen on
-p PORT, --port PORT Port to listen on (default: random)
--webui-port PORT Port for web UI (default: 8081)
--properties-file FILE Path to a custom Spark properties file.
Default is conf/spark-defaults.conf.
...
Master node itself works just fine, I can see its admin ui through 8080, etc.
Any ideas why Spark does not resolve the hostname but only works with an IP address?
The issue is about the _ that I used in the hostname. When I changed spark_master and spark_worker to use - instead, the problem was solved.
Relevant links:
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=6587184
URI - getHost returns null. Why?
The relevant code piece from Spark codebase:
def extractHostPortFromSparkUrl(sparkUrl: String): (String, Int) = {
try {
val uri = new java.net.URI(sparkUrl)
val host = uri.getHost
val port = uri.getPort
if (uri.getScheme != "spark" ||
host == null ||
port < 0 ||
(uri.getPath != null && !uri.getPath.isEmpty) || // uri.getPath returns "" instead of null
uri.getFragment != null ||
uri.getQuery != null ||
uri.getUserInfo != null) {
throw new SparkException("Invalid master URL: " + sparkUrl)
}
(host, port)
} catch {
case e: java.net.URISyntaxException =>
throw new SparkException("Invalid master URL: " + sparkUrl, e)
}
}

Unable to start Spark's "start-all.sh" on EC2 (rhel7)

I am trying to run standalone Spark-2.1.1 by triggering /sbin/start-all.sh in an EC2 instance (RHEL 7). Whenever it runs, it asked for the root#localhost's password and even tough I've given the correct password, it throws me - root#localhost's password: localhost: Permission denied, please try again. error.
Irrespective of this error when I hit jps in the console I could see the Master is running.
root#localhost# jps
27863 Master
28093 Jps
Further I checked the logs and found this-
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/06/12 15:36:15 INFO Master: Started daemon with process name: 27863#localhost.org.xxxxxxxxx.com
17/06/12 15:36:15 INFO SignalUtils: Registered signal handler for TERM
17/06/12 15:36:15 INFO SignalUtils: Registered signal handler for HUP
17/06/12 15:36:15 INFO SignalUtils: Registered signal handler for INT
17/06/12 15:36:15 WARN Utils: Your hostname, localhost.org.xxxxxxxxx.com resolves to a loopback address: 127.0.0.1; using localhost ip instead (on interface eth0)
17/06/12 15:36:15 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
17/06/12 15:36:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/06/12 15:36:16 INFO SecurityManager: Changing view acls to: root
17/06/12 15:36:16 INFO SecurityManager: Changing modify acls to: root
17/06/12 15:36:16 INFO SecurityManager: Changing view acls groups to:
17/06/12 15:36:16 INFO SecurityManager: Changing modify acls groups to:
17/06/12 15:36:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
17/06/12 15:36:16 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
17/06/12 15:36:16 INFO Master: Starting Spark master at spark://localhost.org.xxxxxxxxx.com:7077
17/06/12 15:36:16 INFO Master: Running Spark version 2.1.1
17/06/12 15:36:16 INFO Utils: Successfully started service 'MasterUI' on port 8080.
17/06/12 15:36:16 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://localhost:8080
17/06/12 15:36:16 INFO Utils: Successfully started service on port 6066.
17/06/12 15:36:16 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
17/06/12 15:36:16 INFO Master: I have been elected leader! New state: ALIVE
I am trying to figure out why I am unable to start my worker nodes. Could someone help me out with this ? Thanks.
Check your hostname if it is correctly resolved.
If you're using localhost, make sure it is resolved in your /etc/hosts file.
let me know if this helps. Cheers.

Setting Spark master ip #

I have a Spark workers which can't connect to its master because of an IP issue.
On the start-all.sh on the master (which name is 'pl'), I get the following on the slave log :
16/02/12 21:28:35 INFO WorkerWebUI: Started WorkerWebUI at http://192.168.0.38:8081
16/02/12 21:28:35 INFO Worker: Connecting to master pl:7077...
16/02/12 21:28:35 WARN Worker: Failed to connect to master pl:7077
java.io.IOException: Failed to connect to pl/192.168.0.39:7077
Here is my /etc/hosts file :
$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 wk
192.168.0.39 pl
# The following lines are desirable for IPv6 capable hosts
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
It seems like spark worker is confused between the master names and IP address...How should I set up this ?
Another question is : looking at the master's logs, it seems that the master is listening on another port (7078) than the one the worker is trying to reach (7077) because of a failure to start on the 1st port tried.
romain#pl:~/spark-1.6.0-bin-hadoop2.6/logs$ cat spark-romain-org.apache.spark.deploy.master.Master-1-pl.out
Spark Command: /usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java -cp /home/romain/spark-1.6.0-bin-hadoop2.6/conf/:/home/romain/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/home/romain/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/home/romain/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/home/romain/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip pl --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/02/12 21:28:35 INFO Master: Registered signal handlers for [TERM, HUP, INT]
16/02/12 21:28:35 WARN Utils: Your hostname, pl resolves to a loopback address: 127.0.1.1; using 192.168.0.39 instead (on interface eth0)
16/02/12 21:28:35 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/02/12 21:28:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/02/12 21:28:35 INFO SecurityManager: Changing view acls to: romain
16/02/12 21:28:35 INFO SecurityManager: Changing modify acls to: romain
16/02/12 21:28:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(romain); users with modify permissions: Set(romain)
16/02/12 21:28:36 WARN Utils: Service 'sparkMaster' could not bind on port 7077. Attempting port 7078.
16/02/12 21:28:36 INFO Utils: Successfully started service 'sparkMaster' on port 7078.
16/02/12 21:28:36 INFO Master: Starting Spark master at spark://pl:7078
16/02/12 21:28:36 INFO Master: Running Spark version 1.6.0
16/02/12 21:28:36 WARN Utils: Service 'MasterUI' could not bind on port 8080. Attempting port 8081.
16/02/12 21:28:36 WARN Utils: Service 'MasterUI' could not bind on port 8081. Attempting port 8082.
16/02/12 21:28:36 INFO Utils: Successfully started service 'MasterUI' on port 8082.
16/02/12 21:28:36 INFO MasterWebUI: Started MasterWebUI at http://192.168.0.39:8082
16/02/12 21:28:36 WARN Utils: Service could not bind on port 6066. Attempting port 6067.
16/02/12 21:28:36 INFO Utils: Successfully started service on port 6067.
16/02/12 21:28:36 INFO StandaloneRestServer: Started REST server for submitting applications on port 6067
16/02/12 21:28:36 INFO Master: I have been elected leader! New state: ALIVE
But what is strange is that the local worker logs as if successusfully connected to the local master on port :
16/02/12 21:28:38 INFO Worker: Connecting to master pl:7077...
16/02/12 21:28:38 INFO Worker: Successfully registered with master spark://pl:7077
You can try running netstat -pna | grep 7077 (needs root privileges) on the master to see what process is blocking the port.
Maybe you have another driver instance running. If this is a Java process blocking the port you can use jps to find out more about it.

Resources