Spark slave worker cannot bind to Driver on master - apache-spark

I've got a DSE cluster with 3 machines: 1, 2 and 3.
When I submit an application to the master, if I understood well, it happens like this:
Master receives the application and allocates resources
Driver starts to run into the allocated worker
Driver runs Executors on the other nodes of the cluster to share the workload
So we've got in this cluster this configuration:
1 is Master, has worker 1
2 is slave, has worked 2
3 is slave, has worker 3
When Spark chooses worker 1 (master) for the driver, everything runs just fine.
But when Spark decides to allocate worker 2 (slave) or worker 3 (slave) to the driver, it tries to bind the the ip of the master and fails every time:
INFO 16:20:45 Changing view acls to: cassandra
INFO 16:20:45 Changing modify acls to: cassandra
INFO 16:20:45 SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(cassandra); users with modify permissions: Set(cassandra)
INFO 16:20:45 Slf4jLogger started
ERROR 16:20:46 failed to bind to /10.1.1.1:0, shutting down Netty transport
WARN 16:20:46 Service 'Driver' could not bind on port 0. Attempting port 1.
INFO 16:20:46 Slf4jLogger started
ERROR 16:20:46 failed to bind to /10.1.1.1:0, shutting down Netty transport
WARN 16:20:46 Service 'Driver' could not bind on port 0. Attempting port 1.
The configuration of each node is quite simple:
export SPARK_LOCAL_IP="10.1.1.1" // or .2 or .3
export SPARK_PUBLIC_DNS="xx.xx.xx.xx"
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=7080
export SPARK_DRIVER_HOST="10.1.1.1" // or .2 or .3
export SPARK_WORKER_INSTANCES=1
export SPARK_DRIVER_MEMORY="10G"
I tried to set a spark.driver.port in spark-defaults.conf but it had no effect.
Here is the submit call:
/usr/bin/dse spark-submit --properties-file production.conf --master spark://10.1.1.1:7077 --deploy-mode cluster --class "com.company.SignalIO" aggregation.jar 2015-6-1-00:00:00 2015-6-2-00:00:00 signal_table
Any ideas?

Related

pyspark tasks stuck on Airflow and Spark Standalone Cluster with Docker-compose

I setup Airflow and Spark standalone cluster on docker-compose.
Airflow run spark-submit tasks via spark client mode, which are submitted directly to spark master. However when I execute spark-submit task, the task got stuck.
Spark-submit Command:
spark-submit --verbose --master spark:7077 --name dummy_sql_spark_job ${AIRFLOW_HOME}/dags/spark/spark_sql.py
What i see from spark-submit driver logs:
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220104070012-0011/1 is now EXITED (Command exited with code 1)
22/01/04 07:02:19 INFO StandaloneSchedulerBackend: Executor app-20220104070012-0011/1 removed: Command exited with code 1
22/01/04 07:02:19 INFO BlockManagerMaster: Removal of executor 1 requested
22/01/04 07:02:19 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
22/01/04 07:02:19 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20220104070012-0011/5 on worker-20220104061702-172.27.0.9-38453 (172.27.0.9:38453) with 1 core(s)
22/01/04 07:02:19 INFO StandaloneSchedulerBackend: Granted executor ID app-20220104070012-0011/5 on hostPort 172.27.0.9:38453 with 1 core(s), 1024.0 MiB RAM
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220104070012-0011/5 is now RUNNING
22/01/04 07:02:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:02:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:02:58 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
What i see from one of the spark workers:
spark-worker-1_1 | 22/01/04 07:02:18 INFO SecurityManager: Changing modify acls groups to:
spark-worker-1_1 | 22/01/04 07:02:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
spark-worker-1_1 | 22/01/04 07:02:19 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=5001" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#172.27.0.6:5001" "--executor-id" "3" "--hostname" "172.27.0.11" "--cores" "1" "--app-id" "app-20220104070012-0011" "--worker-url" "spark://Worker#172.27.0.11:35093"
Versions:
Airflow image: apache/airflow:2.2.3
Spark driver version: 3.1.2
Spark server: 3.2.0
Network
All containers airflow-scheduler, airflow-webserver, spark-master, spark-worker-n connected to same external network.
spark-driver is installed under airflow containers (scheduler, webserver), because corresponding dags and tasks are executed by airflow-scheduler.
UPDATE
After replacing driver spark version to match the master's one 3.2.0, the issue get disappeared. So it means, that in my particular case the issue was not due to connectivity between different spark actors (driver, master, worker/executor), but due to version mismatch. For some reason spark workers does not log corresponding error, which is misleading.
Most of the threads was pointing to connectivity issues. However in my case issue was due to mismatch of spark's driver vs master/worker version.
After replacing driver spark version to match the master's one 3.2.0, as well as ensure the same python version both on driver and executor sides (3.9.10) the issue get disappeared. So it means, that in my particular case the issue was not due to connectivity between different spark actors (driver, master, worker/executor), but due to version mismatch. For some reason spark workers does not log corresponding error, which is misleading.

Can't connect slaves to master in Spark

Using 4 instances on Compute Engine, each running spark set up with Cloudera Manager. I have no problems starting the master and connecting in my local browser, and it connects as spark://instance-1:7077. When I start the start-slave on the remaining instances I get no errors, until I look in the log:
16/05/02 13:10:18 INFO worker.Worker: Started daemon with process name: 12612#instance-2.c.cluster1-1294.internal
16/05/02 13:10:18 INFO worker.Worker: Registered signal handlers for [TERM, HUP, INT]
16/05/02 13:10:18 INFO spark.SecurityManager: Changing view acls to: root
16/05/02 13:10:18 INFO spark.SecurityManager: Changing modify acls to: root
16/05/02 13:10:18 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with mod$
16/05/02 13:10:19 INFO util.Utils: Successfully started service 'sparkWorker' on port 60270.
16/05/02 13:10:19 INFO worker.Worker: Starting Spark worker 10.142.0.3:60270 with 2 cores, 6.3 GB RAM
16/05/02 13:10:19 INFO worker.Worker: Running Spark version 1.6.0
16/05/02 13:10:19 INFO worker.Worker: Spark home: /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
16/05/02 13:10:19 ERROR worker.Worker: Failed to create work directory /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark/work
If i use mkdir to create 'work' then it throws and error and says the directory already exists:
mkdir: cannot create directory ‘work’: File exists
The file does exist and when using ls to find it it is highlighted in red with a black background. Any help would be appreciated.
Maybe this is the permission issue,
Try this,
$sudo chown -R your_userName:your_groupName /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
Now change the Mode of the above path
$sudo chmod 777 /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
Also all the slaves must have ssh to each other and can able to talk one another.
And Copy all the Configuration file of spark to the slave nodes also.

Spark hangs on authentication with a Docker Mesos cluster

I'm trying to simulate a multi-node Mesos cluster using Docker and Zookeeper and trying to run a simple (py)Spark job on top of it. These Docker containers and the pyspark script are all run on the same machine. However, when I execute my Spark script, it hangs at:
No credentials provided. Attempting to register without authentication
The Mesos slave constantly outputs:
I0929 14:59:32.925915 62 slave.cpp:1959] Asked to shut down framework 20150929-143802-1224741292-5050-33-0060 by master#172.17.0.73:5050
W0929 14:59:32.926035 62 slave.cpp:1974] Cannot shut down unknown framework 20150929-143802-1224741292-5050-33-0060
And the Mesos master constantly outputs:
I0929 14:38:15.169683 39 master.cpp:2094] Received SUBSCRIBE call for framework 'test' at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0.1.1:46693
I0929 14:38:15.169845 39 master.cpp:2164] Subscribing framework test with checkpointing disabled and capabilities [ ]
E0929 14:38:15.170361 42 socket.hpp:174] Shutdown failed on fd=15: Transport endpoint is not connected [107]
I0929 14:38:15.170409 36 hierarchical.hpp:391] Added framework 20150929-143802-1224741292-5050-33-0000
I0929 14:38:15.170534 39 master.cpp:1051] Framework 20150929-143802-1224741292-5050-33-0000 (test) at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0.1.1:46693 disconnected
I0929 14:38:15.170549 39 master.cpp:2370] Disconnecting framework 20150929-143802-1224741292-5050-33-0000 (test) at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0.1.1:46693
I0929 14:38:15.170555 39 master.cpp:2394] Deactivating framework 20150929-143802-1224741292-5050-33-0000 (test) at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0.1.1:46693
E0929 14:38:15.170560 42 socket.hpp:174] Shutdown failed on fd=16: Transport endpoint is not connected [107]
I0929 14:38:15.170593 39 master.cpp:1075] Giving framework 20150929-143802-1224741292-5050-33-0000 (test) at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0.1.1:46693 0ns to failover
W0929 14:38:15.170835 41 master.cpp:4482] Master returning resources offered to framework 20150929-143802-1224741292-5050-33-0000 because the framework has terminated or is inactive
I0929 14:38:15.170855 36 hierarchical.hpp:474] Deactivated framework 20150929-143802-1224741292-5050-33-0000
I0929 14:38:15.170990 37 hierarchical.hpp:814] Recovered cpus(*):8; mem(*):31092; disk(*):443036; ports(*):[31000-32000] (total: cpus(*):8; mem(*):31092; disk(*):443036; ports(*):[31000-32000
], allocated: ) on slave 20150929-051336-1224741292-5050-19-S0 from framework 20150929-143802-1224741292-5050-33-0000
I0929 14:38:15.171820 41 master.cpp:4469] Framework failover timeout, removing framework 20150929-143802-1224741292-5050-33-0000 (test) at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0
.1.1:46693
I0929 14:38:15.171835 41 master.cpp:5112] Removing framework 20150929-143802-1224741292-5050-33-0000 (test) at scheduler-2f4e1e52-a04a-401f-b9aa-1253554fe73b#127.0.1.1:46693
I0929 14:38:15.172130 41 hierarchical.hpp:428] Removed framework 20150929-143802-1224741292-5050-33-0000
The Mesos master Docker image is built with the following Dockerfile
FROM ubuntu:14.04
ENV MESOS_V 0.24.0
# update
RUN apt-get update
RUN apt-get upgrade -y
# dependencies
RUN apt-get install -y wget openjdk-7-jdk build-essential python-dev python-boto libcurl4-nss-dev libsasl2-dev maven libapr1-dev libsvn-dev
# mesos
RUN wget http://www.apache.org/dist/mesos/${MESOS_V}/mesos-${MESOS_V}.tar.gz
RUN tar -zxf mesos-*.tar.gz
RUN rm mesos-*.tar.gz
RUN mv mesos-* mesos
WORKDIR mesos
RUN mkdir build
RUN ./configure
RUN make
RUN make install
RUN ldconfig
EXPOSE 5050
ENTRYPOINT ["/bin/bash"]
And I manually execute the mesos-master command:
LIBPROCESS_IP=${MASTER_IP} mesos-master --registry=in_memory --ip=${MASTER_IP} --zk=zk://172.17.0.75:2181/mesos --advertise_ip=${MASTER_IP}
The Mesos slave Docker image is built using the same Dockerfile except port 5051 is exposed instead. Then I run the following command in its container:
LIBPROCESS_IP=172.17.0.72 mesos-slave --master=zk://172.17.0.75:2181/mesos
The pyspark script is:
import os
import pyspark
src = 'file:///{}/README.md'.format(os.environ['SPARK_HOME'])
leader_ip = '172.17.0.75'
conf = pyspark.SparkConf()
conf.setMaster('mesos://zk://{}:2181/mesos'.format(leader_ip))
conf.set('spark.executor.uri', 'http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0-bin-hadoop2.6.tgz')
conf.setAppName('my_test_app')
sc = pyspark.SparkContext(conf=conf)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(' '))
word_count = (words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y))
print(word_count.collect())
Here is the complete output of the pyspark script:
15/09/29 11:07:59 INFO SparkContext: Running Spark version 1.5.0
15/09/29 11:07:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/29 11:07:59 WARN Utils: Your hostname, hubble resolves to a loopback address: 127.0.1.1; using 192.168.1.2 instead (on interface em1)
15/09/29 11:07:59 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/09/29 11:07:59 INFO SecurityManager: Changing view acls to: ftseng
15/09/29 11:07:59 INFO SecurityManager: Changing modify acls to: ftseng
15/09/29 11:07:59 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ftseng); users with modify permissions: Set(ftseng)
15/09/29 11:08:00 INFO Slf4jLogger: Slf4jLogger started
15/09/29 11:08:00 INFO Remoting: Starting remoting
15/09/29 11:08:00 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#192.168.1.2:38860]
15/09/29 11:08:00 INFO Utils: Successfully started service 'sparkDriver' on port 38860.
15/09/29 11:08:00 INFO SparkEnv: Registering MapOutputTracker
15/09/29 11:08:00 INFO SparkEnv: Registering BlockManagerMaster
15/09/29 11:08:00 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-28695bd2-fc83-45f4-b0a0-eefcfb80a3b5
15/09/29 11:08:00 INFO MemoryStore: MemoryStore started with capacity 530.3 MB
15/09/29 11:08:00 INFO HttpFileServer: HTTP File server directory is /tmp/spark-89444c7a-725a-4454-87db-8873f4134580/httpd-341c3da9-16d5-43a4-93ee-0e8b47389fdb
15/09/29 11:08:00 INFO HttpServer: Starting HTTP Server
15/09/29 11:08:00 INFO Utils: Successfully started service 'HTTP file server' on port 51405.
15/09/29 11:08:00 INFO SparkEnv: Registering OutputCommitCoordinator
15/09/29 11:08:00 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/09/29 11:08:00 INFO SparkUI: Started SparkUI at http://192.168.1.2:4040
15/09/29 11:08:00 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#716: Client environment:host.name=hubble
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#723: Client environment:os.name=Linux
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#724: Client environment:os.arch=3.19.0-25-generic
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#725: Client environment:os.version=#26-Ubuntu SMP Fri Jul 24 21:17:31 UTC 2015
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#733: Client environment:user.name=ftseng
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#741: Client environment:user.home=/home/ftseng
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#log_env#753: Client environment:user.dir=/home/ftseng
2015-09-29 11:08:00,651:32221(0x7fc09e17c700):ZOO_INFO#zookeeper_init#786: Initiating client connection, host=172.17.0.75:2181 sessionTimeout=10000 watcher=0x7fc0962b7176 sessionId=0 sessionPasswd=<null> context=0x7fc078001860 flags=0
I0929 11:08:00.651923 32328 sched.cpp:164] Version: 0.24.0
2015-09-29 11:08:00,652:32221(0x7fc06bfff700):ZOO_INFO#check_events#1703: initiated connection to server [172.17.0.75:2181]
2015-09-29 11:08:00,657:32221(0x7fc06bfff700):ZOO_INFO#check_events#1750: session establishment complete on server [172.17.0.75:2181], sessionId=0x150177fcfc40014, negotiated timeout=10000
I0929 11:08:00.658051 32322 group.cpp:331] Group process (group(1)#127.0.1.1:48692) connected to ZooKeeper
I0929 11:08:00.658083 32322 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0929 11:08:00.658100 32322 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
I0929 11:08:00.659600 32326 detector.cpp:156] Detected a new leader: (id='2')
I0929 11:08:00.659904 32325 group.cpp:674] Trying to get '/mesos/json.info_0000000002' in ZooKeeper
I0929 11:08:00.661052 32326 detector.cpp:481] A new leading master (UPID=master#172.17.0.73:5050) is detected
I0929 11:08:00.661201 32320 sched.cpp:262] New master detected at master#172.17.0.73:5050
I0929 11:08:00.661798 32320 sched.cpp:272] No credentials provided. Attempting to register without authentication
After a lot more experimentation, it looks like it was an issue with the IP address of the host machine (using its local network address, 192.168.xx.xx) when it should have been using its Docker host IP (172.17.xx.xx).
I managed to get things running with:
LIBPROCESS_IP=172.17.xx.xx python test_spark.py
I'm now hitting a different error, but it seems unrelated, so I think this command solves my problem.
I'm not familiar enough with Mesos/Spark yet to understand why this fixes things, so if someone wants to add an explanation, that would be very helpful.

set the ip of the mesos master on a slave

I have three mesos slave nodes and one master on 10.14.56.157, 10.14.56.159 and 10.14.56.160 and 10.14.56.156 respectively. The names of the machines are worker1, worker2, worker3 and master.
I managed to set up the mesos cluster correctly (I believe). The web UI on 10.0.0.4:5050 shows all the three slaves. Then I'm running a spark shell on the cluster. Everything initially works fine: shell starts, UI shows a new framework started etc. Then I'm trying to run a simple test:
val numbers = sc.parallelize(1 to 1000000, 1000)
which works fine and then
numbers.count
Of course this is when spark actually does some work. So it starts the tasks, sends it to slaves (I can see it in the logs) but then none of the tasks completes (status: LOST). Spark retries the tasks up to 4 times and eventually gives up. I looked into the logs on the slave machines (the sandbox link in the UI) and I get the following output:
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0227 13:47:59.842319 17015 fetcher.cpp:76] Fetching URI '/home/user01/spark-1.2.1-bin-hadoop1.tgz'
I0227 13:47:59.842658 17015 fetcher.cpp:179] Copying resource from '/home/user01/spark-1.2.1-bin-hadoop1.tgz' to '/tmp/mesos/slaves/20150226-160235-2620919306-5050-14323-1/frameworks/20150227-132220-2620919306-5050-30420-0001/executors/20150226-160235-2620919306-5050-14323-1/runs/1978f267-cb47-4a6c-bd1f-69e99c00ae13'
I0227 13:48:09.896682 17015 fetcher.cpp:64] Extracted resource '/tmp/mesos/slaves/20150226-160235-2620919306-5050-14323-1/frameworks/20150227-132220-2620919306-5050-30420-0001/executors/20150226-160235-2620919306-5050-14323-1/runs/1978f267-cb47-4a6c-bd1f-69e99c00ae13/spark-1.2.1-bin-hadoop1.tgz' into '/tmp/mesos/slaves/20150226-160235-2620919306-5050-14323-1/frameworks/20150227-132220-2620919306-5050-30420-0001/executors/20150226-160235-2620919306-5050-14323-1/runs/1978f267-cb47-4a6c-bd1f-69e99c00ae13'
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/02/27 13:48:11 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
I0227 13:48:11.493357 17124 exec.cpp:132] Version: 0.20.1
I0227 13:48:11.496057 17142 exec.cpp:206] Executor registered on slave 20150226-160235-2620919306-5050-14323-1
15/02/27 13:48:11 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150226-160235-2620919306-5050-14323-1 with 1 cpus
15/02/27 13:48:11 INFO Executor: Starting executor ID 20150226-160235-2620919306-5050-14323-1 on host 10.14.56.160
15/02/27 13:48:11 INFO SecurityManager: Changing view acls to: user01
15/02/27 13:48:11 INFO SecurityManager: Changing modify acls to: user01
15/02/27 13:48:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user01); users with modify permissions: Set(user01)
15/02/27 13:48:12 INFO Slf4jLogger: Slf4jLogger started
15/02/27 13:48:12 INFO Remoting: Starting remoting
15/02/27 13:48:12 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor#10.14.56.160:42869]
15/02/27 13:48:12 INFO Utils: Successfully started service 'sparkExecutor' on port 42869.
15/02/27 13:48:12 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver#master:48886/user/MapOutputTracker
15/02/27 13:48:12 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver#master:48886]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: master: Name or service not known
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver#master:48886/), Path(/user/MapOutputTracker)]
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
at akka.remote.EndpointWriter.postStop(Endpoint.scala:561)
at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)
at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
at akka.actor.ActorCell.terminate(ActorCell.scala:369)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Exception in thread "Thread-1" I0227 13:48:12.364940 17142 exec.cpp:413] Deactivating the executor libprocess
The line where the error occurs says:
Tried to associate with unreachable remote address [akka.tcp://sparkDriver#master:48886]
It seems to me that the slave cannot resolve the name master to the master's IP. Is that correct? If so how to change it to the actual IP. If not, how to fix it? Thanks!
What happens if you type ping master on one of the slave machines? If that fails, that's your problem, and you could fix it by adding a line to each slave's /etc/hosts file pointing master to the correct IP.
You could also try setting spark.driver.host to its IP when launching the spark driver, to change what "host" it advertises itself as.

Spark driver program launching in `cluster` mode failed in a weird way

I'm new to Spark. Now I encountered a problem: when I launch a program in a standalone spark cluster while command line:
./spark-submit --class scratch.Pi --deploy-mode cluster --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 hdfs://bx-42-68:9000/jars/pi.jar
It will throws following error:
15/01/28 19:48:51 INFO Slf4jLogger: Slf4jLogger started
15/01/28 19:48:51 INFO Utils: Successfully started service 'driverClient' on port 59290.
Sending launch command to spark://bx-42-68:7077
Driver successfully submitted as driver-20150128194852-0003
... waiting before polling master for driver state
... polling master for driver state
State of driver-20150128194852-0003 is FAILED
Master of cluster outputs following log:
15/01/28 19:48:52 INFO Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper
15/01/28 19:48:52 INFO Master: Launching driver driver-20150128194852-0003 on worker worker-20150126133948-bx-42-151-26286
15/01/28 19:48:55 INFO Master: Removing driver: driver-20150128194852-0003
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient#bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 INFO Master: akka.tcp://driverClient#bx-42-68:59290 got disassociated, removing it.
15/01/28 19:48:57 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://driverClient#bx-42-68:59290] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/28 19:48:57 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.16.42.68%3A48091-16#-1393479428] was not delivered. [9] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
And the corresponding worker for launching driver program outputs:
15/01/28 19:48:52 INFO Worker: Asked to launch driver driver-20150128194852-0003
15/01/28 19:48:52 INFO DriverRunner: Copying user jar hdfs://bx-42-68:9000/jars/pi.jar to /data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/01/28 19:48:55 INFO DriverRunner: Launch Command: "/opt/apps/jdk-1.7.0_60/bin/java" "-cp" "/data11/spark-1.2.0-bin-hadoop2.4/work/driver-20150128194852-0003/pi.jar:::/data11/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/data11/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/data11/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar" "-XX:MaxPermSize=128m" "-Dspark.executor.memory=5g" "-Dspark.akka.askTimeout=10" "-Dspark.rdd.compress=true" "-Dspark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" "-Dspark.serializer=org.apache.spark.serializer.KryoSerializer" "-Dspark.app.name=YANL" "-Dspark.driver.extraJavaOptions=-XX:MaxPermSize=1024m" "-Dspark.jars=hdfs://bx-42-68:9000/jars/pi.jar" "-Dspark.master=spark://bx-42-68:7077" "-Dspark.storage.memoryFraction=0.6" "-Dakka.loglevel=WARNING" "-XX:MaxPermSize=1024m" "-Xms5120M" "-Xmx5120M" "org.apache.spark.deploy.worker.DriverWrapper" "akka.tcp://sparkWorker#bx-42-151:26286/user/Worker" "scratch.Pi"
15/01/28 19:48:55 WARN Worker: Driver driver-20150128194852-0003 exited with failure
My spark-env.sh is:
export SCALA_HOME=/opt/apps/scala-2.11.5
export JAVA_HOME=/opt/apps/jdk-1.7.0_60
export SPARK_HOME=/data11/spark-1.2.0-bin-hadoop2.4
export PATH=$JAVA_HOME/bin:$PATH
export SPARK_MASTER_IP=`hostname -f`
export SPARK_LOCAL_IP=`hostname -f`
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=10.16.42.68:2181,10.16.42.134:2181,10.16.42.151:2181,10.16.42.150:2181,10.16.42.125:2181 -Dspark.deploy.zookeeper.dir=/spark"
SPARK_WORKER_MEMORY=43g
SPARK_WORKER_CORES=22
And my spark-defaults.conf is:
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.executor.memory 20g
spark.rdd.compress true
spark.storage.memoryFraction 0.6
spark.serializer org.apache.spark.serializer.KryoSerializer
However, when I launch the program with client mode with following command, it works fine.
./spark-submit --class scratch.Pi --deploy-mode client --executor-memory 5g --name pi --driver-memory 5g --driver-java-options "-XX:MaxPermSize=1024m" --master spark://bx-42-68:7077 /data11/pi.jar
The reason why it works in "client" mode and not in "cluster" mode is because there is no support for "cluster" mode in a standalone cluster.(mentioned in the spark documentation).
Alternatively, if your application is submitted from a machine far from the worker machines (e.g. locally on your laptop), it is common to use cluster mode to minimize network latency between the drivers and the executors.
Note that cluster mode is currently not supported for standalone
clusters, Mesos clusters, or python applications.
If you look at "Submitting Applications" section in spark documentation, it is clearly mentioned that the support for cluster mode is not available in standalone clusters.
Reference link : http://spark.apache.org/docs/1.2.0/submitting-applications.html
Go to above link and have a look at "Launching Applications with spark-submit" section.
Think it will help. Thanks.

Resources