Can't connect slaves to master in Spark

Can't connect slaves to master in Spark - apache-spark

Using 4 instances on Compute Engine, each running spark set up with Cloudera Manager. I have no problems starting the master and connecting in my local browser, and it connects as spark://instance-1:7077. When I start the start-slave on the remaining instances I get no errors, until I look in the log:
16/05/02 13:10:18 INFO worker.Worker: Started daemon with process name: 12612#instance-2.c.cluster1-1294.internal
16/05/02 13:10:18 INFO worker.Worker: Registered signal handlers for [TERM, HUP, INT]
16/05/02 13:10:18 INFO spark.SecurityManager: Changing view acls to: root
16/05/02 13:10:18 INFO spark.SecurityManager: Changing modify acls to: root
16/05/02 13:10:18 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with mod$
16/05/02 13:10:19 INFO util.Utils: Successfully started service 'sparkWorker' on port 60270.
16/05/02 13:10:19 INFO worker.Worker: Starting Spark worker 10.142.0.3:60270 with 2 cores, 6.3 GB RAM
16/05/02 13:10:19 INFO worker.Worker: Running Spark version 1.6.0
16/05/02 13:10:19 INFO worker.Worker: Spark home: /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
16/05/02 13:10:19 ERROR worker.Worker: Failed to create work directory /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark/work
If i use mkdir to create 'work' then it throws and error and says the directory already exists:
mkdir: cannot create directory ‘work’: File exists
The file does exist and when using ls to find it it is highlighted in red with a black background. Any help would be appreciated.

Maybe this is the permission issue,
Try this,
$sudo chown -R your_userName:your_groupName /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
Now change the Mode of the above path
$sudo chmod 777 /opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/lib/spark
Also all the slaves must have ssh to each other and can able to talk one another.
And Copy all the Configuration file of spark to the slave nodes also.

Related

It's possible to configure the Beam portable runner with the spark configurations?

TLDR;
It's possible to configure the Beam portable runner with the spark configurations? More precisely, it's possible to configure the spark.driver.host in the Portable Runner?
Motivation
Currently, we have airflow implemented in a Kubernetes cluster, and aiming to use TensorFlow Extended we need to use Apache beam. For our use case Spark would be the appropriate runner to be used, and as airflow and TensorFlow are coded in python we would need to use the Apache Beam's Portable Runner (https://beam.apache.org/documentation/runners/spark/#portability).
The problem
The portable runner creates the spark context inside its container and does not leave space for the driver DNS configuration making the executors inside the worker pods non-communicable to the driver (the job server).
Setup
Following the beam documentation, the job serer was implemented in the same pod as the airflow to use the local network between these two containers.
Job server config:
- name: beam-spark-job-server
image: apache/beam_spark_job_server:2.27.0
args: ["--spark-master-url=spark://spark-master:7077"]
Job server/airflow service:
apiVersion: v1
kind: Service
metadata:
name: airflow-scheduler
labels:
app: airflow-k8s
spec:
type: ClusterIP
selector:
app: airflow-scheduler
ports:
- port: 8793
protocol: TCP
targetPort: 8793
name: scheduler
- port: 8099
protocol: TCP
targetPort: 8099
name: job-server
- port: 7077
protocol: TCP
targetPort: 7077
name: spark-master
- port: 8098
protocol: TCP
targetPort: 8098
name: artifact
- port: 8097
protocol: TCP
targetPort: 8097
name: java-expansion
The ports 8097,8098 and 8099 are related to the job server, 8793 to airflow, and 7077 to the spark master.
Development/Errors
When testing a simple beam example python -m apache_beam.examples.wordcount --output ./data_test/ --runner=PortableRunner --job_endpoint=localhost:8099 --environment_type=LOOPBACK from the airflow container I get the following response on the airflow pod:
Defaulting container name to airflow-scheduler.
Use 'kubectl describe pod/airflow-scheduler-local-f685b5bc7-9d7r6 -n airflow-main-local' to see all of the containers in this pod.
airflow#airflow-scheduler-local-f685b5bc7-9d7r6:/opt/airflow$ python -m apache_beam.examples.wordcount --output ./data_test/ --runner=PortableRunner --job_endpoint=localhost:8099 --environment_type=LOOPBACK
INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds.
INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds.
INFO:oauth2client.client:Timeout attempting to reach GCE metadata service.
WARNING:apache_beam.internal.gcp.auth:Unable to find default credentials to use: The Application Default Credentials are not available. They are available if running in Google Compute Engine. Otherwise, the environment variable GOOGLE_APPLICATION_CREDENTIALS must be defined pointing to a file defining the credentials. See https://developers.google.com/accounts/docs/application-default-credentials for more information.
Connecting anonymously.
INFO:apache_beam.runners.worker.worker_pool_main:Listening for workers at localhost:35837
WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.
INFO:root:Default Python SDK image for environment is apache/beam_python3.7_sdk:2.27.0
INFO:apache_beam.runners.portability.portable_runner:Environment "LOOPBACK" has started a component necessary for the execution. Be sure to run the pipeline using
with Pipeline() as p:
p.apply(..)
This ensures that the pipeline finishes before this program exits.
INFO:apache_beam.runners.portability.portable_runner:Job state changed to STOPPED
INFO:apache_beam.runners.portability.portable_runner:Job state changed to STARTING
INFO:apache_beam.runners.portability.portable_runner:Job state changed to RUNNING
And the worker log:
21/02/19 19:50:00 INFO Worker: Asked to launch executor app-20210219194804-0000/47 for BeamApp-root-0219194747-7d7938cf_51452c51-dffe-4c61-bcb7-60c7779e3256
21/02/19 19:50:00 INFO SecurityManager: Changing view acls to: root
21/02/19 19:50:00 INFO SecurityManager: Changing modify acls to: root
21/02/19 19:50:00 INFO SecurityManager: Changing view acls groups to:
21/02/19 19:50:00 INFO SecurityManager: Changing modify acls groups to:
21/02/19 19:50:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/02/19 19:50:00 INFO ExecutorRunner: Launch command: "/usr/local/openjdk-8/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=44447" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#airflow-scheduler-local-f685b5bc7-9d7r6:44447" "--executor-id" "47" "--hostname" "172.18.0.3" "--cores" "1" "--app-id" "app-20210219194804-0000" "--worker-url" "spark://Worker#172.18.0.3:35837"
21/02/19 19:50:02 INFO Worker: Executor app-20210219194804-0000/47 finished with state EXITED message Command exited with code 1 exitStatus 1
21/02/19 19:50:02 INFO ExternalShuffleBlockResolver: Clean up non-shuffle files associated with the finished executor 47
21/02/19 19:50:02 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20210219194804-0000, execId=47)
21/02/19 19:50:02 INFO Worker: Asked to launch executor app-20210219194804-0000/48 for BeamApp-root-0219194747-7d7938cf_51452c51-dffe-4c61-bcb7-60c7779e3256
21/02/19 19:50:02 INFO SecurityManager: Changing view acls to: root
21/02/19 19:50:02 INFO SecurityManager: Changing modify acls to: root
21/02/19 19:50:02 INFO SecurityManager: Changing view acls groups to:
21/02/19 19:50:02 INFO SecurityManager: Changing modify acls groups to:
21/02/19 19:50:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/02/19 19:50:02 INFO ExecutorRunner: Launch command: "/usr/local/openjdk-8/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=44447" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#airflow-scheduler-local-f685b5bc7-9d7r6:44447" "--executor-id" "48" "--hostname" "172.18.0.3" "--cores" "1" "--app-id" "app-20210219194804-0000" "--worker-url" "spark://Worker#172.18.0.3:35837"
21/02/19 19:50:04 INFO Worker: Executor app-20210219194804-0000/48 finished with state EXITED message Command exited with code 1 exitStatus 1
21/02/19 19:50:04 INFO ExternalShuffleBlockResolver: Clean up non-shuffle files associated with the finished executor 48
21/02/19 19:50:04 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20210219194804-0000, execId=48)
21/02/19 19:50:04 INFO Worker: Asked to launch executor app-20210219194804-0000/49 for BeamApp-root-0219194747-7d7938cf_51452c51-dffe-4c61-bcb7-60c7779e3256
21/02/19 19:50:04 INFO SecurityManager: Changing view acls to: root
21/02/19 19:50:04 INFO SecurityManager: Changing modify acls to: root
21/02/19 19:50:04 INFO SecurityManager: Changing view acls groups to:
21/02/19 19:50:04 INFO SecurityManager: Changing modify acls groups to:
21/02/19 19:50:04 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/02/19 19:50:04 INFO ExecutorRunner: Launch command: "/usr/local/openjdk-8/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=44447" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#airflow-scheduler-local-f685b5bc7-9d7r6:44447" "--executor-id" "49" "--hostname" "172.18.0.3" "--cores" "1" "--app-id" "app-20210219194804-0000" "--worker-url" "spark://Worker#172.18.0.3:35837"
.
.
.
As we can see, the executor is being exited constantly, and by what I know this issue is created by the missing communication between the executor and the driver (the job server in this case). Also, the "--driver-url" is translated to the driver pod name using the random port "-Dspark.driver.port".
As we can't define the name of the service, the worker tries to use the original name from the driver and to use a randomly generated port. As the configuration comes from the driver, changing the default conf files in the worker/master doesn't create any results.
Using this answer as an example, I tried to use the env variable SPARK_PUBLIC_DNS in the job server but this didn't result in any changes in the worker logs.
Obs
Using directly in kubernetes a spark job
kubectl run spark-base --rm -it --labels="app=spark-client" --image bde2020/spark-base:2.4.5-hadoop2.7 -- bash ./spark/bin/pyspark --master spark://spark-master:7077 --conf spark.driver.host=spark-client
having the service:
apiVersion: v1
kind: Service
metadata:
name: spark-client
spec:
selector:
app: spark-client
clusterIP: None
I get a full working pyspark shell. If I omit the --conf parameter I get the same behavior as the first setup (exiting executors indefinitely)
21/02/19 20:21:02 INFO Worker: Executor app-20210219202050-0002/4 finished with state EXITED message Command exited with code 1 exitStatus 1
21/02/19 20:21:02 INFO ExternalShuffleBlockResolver: Clean up non-shuffle files associated with the finished executor 4
21/02/19 20:21:02 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20210219202050-0002, execId=4)
21/02/19 20:21:02 INFO Worker: Asked to launch executor app-20210219202050-0002/5 for Spark shell
21/02/19 20:21:02 INFO SecurityManager: Changing view acls to: root
21/02/19 20:21:02 INFO SecurityManager: Changing modify acls to: root
21/02/19 20:21:02 INFO SecurityManager: Changing view acls groups to:
21/02/19 20:21:02 INFO SecurityManager: Changing modify acls groups to:
21/02/19 20:21:02 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
21/02/19 20:21:02 INFO ExecutorRunner: Launch command: "/usr/local/openjdk-8/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=46161" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#spark-base:46161" "--executor-id" "5" "--hostname" "172.18.0.20" "--cores" "1" "--app-id" "app-20210219202050-0002" "--worker-url" "spark://Worker#172.18.0.20:45151"

I have three solutions to choose from depending on your deployment requirements. In order of difficulty:
Use the Spark "uber jar" job server. This starts an embedded job server inside the Spark master, instead of using a standalone job server in a container. This would simplify your deployment a lot, since you would not need to start the beam_spark_job_server container at all.
python -m apache_beam.examples.wordcount \
--output ./data_test/ \
--runner=SparkRunner \
--spark_submit_uber_jar \
--spark_master_url=spark://spark-master:7077 \
--environment_type=LOOPBACK
You can pass the properties through a Spark configuration file. Create the Spark configuration file, and add spark.driver.host and whatever other properties you need. In the docker run command for the job server, mount that configuration file to the container, and set the SPARK_CONF_DIR environment variable to point to that directory.
If that neither of those work for you, you can alternatively build your own customized version of the job server container. Pull Beam source from Github. Check out the release branch you want to use (e.g. git checkout origin/release-2.28.0). Modify the entrypoint spark-job-server.sh and set -Dspark.driver.host=x there. Then build the container using ./gradlew :runners:spark:job-server:container:docker -Pdocker-repository-root="your-repo" -Pdocker-tag="your-tag".

Let me revise the answer. The Job server need to able to communicate with the workers vice verse. The error of keep exiting is due to this. You need to configure such that they can communicate. A k8s headless service able to solve this.
reference of workable example at https://github.com/cometta/python-apache-beam-spark . If it is useful for you, can help me to 'Star' the repository

spark-submit: unable to get driver status

I'm running a job on a test Spark standalone in cluster mode, but I'm finding myself unable to monitor the status of the driver.
Here is a minimal example using spark-2.4.3 (master and one worker running on the same node, started running sbin/start-all.sh on a freshly unarchived installation using the default conf, no conf/slaves set), executing spark-submit from the node itself:
$ spark-submit --master spark://ip-172-31-15-245:7077 --deploy-mode cluster \
--class org.apache.spark.examples.SparkPi \
/home/ubuntu/spark/examples/jars/spark-examples_2.11-2.4.3.jar 100
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.NativeCodeLoader).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/06/27 09:08:28 INFO SecurityManager: Changing view acls to: ubuntu
19/06/27 09:08:28 INFO SecurityManager: Changing modify acls to: ubuntu
19/06/27 09:08:28 INFO SecurityManager: Changing view acls groups to:
19/06/27 09:08:28 INFO SecurityManager: Changing modify acls groups to:
19/06/27 09:08:28 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); groups with view permissions: Set(); users with modify permissions: Set(ubuntu); groups with modify permissions: Set()
19/06/27 09:08:28 INFO Utils: Successfully started service 'driverClient' on port 36067.
19/06/27 09:08:28 INFO TransportClientFactory: Successfully created connection to ip-172-31-15-245/172.31.15.245:7077 after 29 ms (0 ms spent in bootstraps)
19/06/27 09:08:28 INFO ClientEndpoint: Driver successfully submitted as driver-20190627090828-0008
19/06/27 09:08:28 INFO ClientEndpoint: ... waiting before polling master for driver state
19/06/27 09:08:33 INFO ClientEndpoint: ... polling master for driver state
19/06/27 09:08:33 INFO ClientEndpoint: State of driver-20190627090828-0008 is RUNNING
19/06/27 09:08:33 INFO ClientEndpoint: Driver running on 172.31.15.245:41057 (worker-20190627083412-172.31.15.245-41057)
19/06/27 09:08:33 INFO ShutdownHookManager: Shutdown hook called
19/06/27 09:08:33 INFO ShutdownHookManager: Deleting directory /tmp/spark-34082661-f0de-4c56-92b7-648ea24fa59c
> spark-submit --master spark://ip-172-31-15-245:7077 --status driver-20190627090828-0008
19/06/27 09:09:27 WARN RestSubmissionClient: Unable to connect to server spark://ip-172-31-15-245:7077.
Exception in thread "main" org.apache.spark.deploy.rest.SubmitRestConnectionException: Unable to connect to server
at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$requestSubmissionStatus$3.apply(RestSubmissionClient.scala:165)
at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$requestSubmissionStatus$3.apply(RestSubmissionClient.scala:148)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.deploy.rest.RestSubmissionClient.requestSubmissionStatus(RestSubmissionClient.scala:148)
at org.apache.spark.deploy.SparkSubmit.requestStatus(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:88)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.deploy.rest.SubmitRestConnectionException: No response from server
at org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:285)
at org.apache.spark.deploy.rest.RestSubmissionClient.org$apache$spark$deploy$rest$RestSubmissionClient$$get(RestSubmissionClient.scala:195)
at org.apache.spark.deploy.rest.RestSubmissionClient$$anonfun$requestSubmissionStatus$3.apply(RestSubmissionClient.scala:152)
... 11 more
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.deploy.rest.RestSubmissionClient.readResponse(RestSubmissionClient.scala:278)
... 13 more
Spark is in good health (I'm able to run other jobs after the one above), the driver-20190627090828-0008 appears as "FINISHED" in the web UI.
Is there something I am missing?
UPDATE:
on the master log all I get is
19/07/01 09:40:24 INFO master.Master: 172.31.15.245:42308 got disassociated, removing it.

Spark Cluster, failed to connect to master. (WARN Worker: Failed to connect to master)

I have a spark cluster with 2 nodes, master(172.17.0.229) and slave(172.17.0.228). I have edited spark-env.sh, added SPARK_MASTER_IP=127.17.0.229 and slaves, added 172.17.0.228.
I am starting my master node using start-master.sh and slave node using start-slaves.sh.
I can see the webUI with a master node with no worker, but the log of worker node is as:
Spark Command: /usr/lib/jvm/java-7-oracle/jre/bin/java -cp /usr/local/src/spark-1.5.2-bin-hadoop2.6/sbin/../conf/:/usr/local/src/spark-1.5.2-bin-hadoop$
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/12/18 14:17:25 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
15/12/18 14:17:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/18 14:17:26 INFO SecurityManager: Changing view acls to: ujjwal
15/12/18 14:17:26 INFO SecurityManager: Changing modify acls to: ujjwal
15/12/18 14:17:26 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ujjwal); users wit$
15/12/18 14:17:27 INFO Slf4jLogger: Slf4jLogger started
15/12/18 14:17:27 INFO Remoting: Starting remoting
15/12/18 14:17:27 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker#172.17.0.228:47599]
15/12/18 14:17:27 INFO Utils: Successfully started service 'sparkWorker' on port 47599.
15/12/18 14:17:27 INFO Worker: Starting Spark worker 172.17.0.228:47599 with 2 cores, 2.7 GB RAM
15/12/18 14:17:27 INFO Worker: Running Spark version 1.5.2
15/12/18 14:17:27 INFO Worker: Spark home: /usr/local/src/spark-1.5.2-bin-hadoop2.6
15/12/18 14:17:27 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
15/12/18 14:17:27 INFO WorkerWebUI: Started WorkerWebUI at http://172.17.0.228:8081
15/12/18 14:17:27 INFO Worker: Connecting to master 127.17.0.229:7077...
15/12/18 14:17:27 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkMaster#127.17.0.229:7077] has failed, address is now$
15/12/18 14:17:27 WARN Worker: Failed to connect to master 127.17.0.229:7077
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkMaster#127.17.0.229:7077/), Path(/user/Master)]
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:73)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:120)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:266)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:533)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:569)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:559)
at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
at akka.remote.EndpointWriter.postStop(Endpoint.scala:557)
at akka.actor.Actor$class.aroundPostStop(Actor.scala:477)
at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:411)
at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
at akka.actor.ActorCell.terminate(ActorCell.scala:369)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
Thanks for your suggestion.

Generally, checking the IP that your worker is trying to connect to against the reported spark://...:7077 address on the web UI at 172.17.0.229 port 8080 will help identify whether the address is correct.
In this particular case, it looks like you have a typo; change
SPARK_MASTER_IP=127.17.0.229
to read:
SPARK_MASTER_IP=172.17.0.229
(you seem to have 127/172 inverted).

My issue was a version mismatch between the spark java library I was using (2.0.0) and the version of the spark cluster (2.2.1)

set the ip of the mesos master on a slave

I have three mesos slave nodes and one master on 10.14.56.157, 10.14.56.159 and 10.14.56.160 and 10.14.56.156 respectively. The names of the machines are worker1, worker2, worker3 and master.
I managed to set up the mesos cluster correctly (I believe). The web UI on 10.0.0.4:5050 shows all the three slaves. Then I'm running a spark shell on the cluster. Everything initially works fine: shell starts, UI shows a new framework started etc. Then I'm trying to run a simple test:
val numbers = sc.parallelize(1 to 1000000, 1000)
which works fine and then
numbers.count
Of course this is when spark actually does some work. So it starts the tasks, sends it to slaves (I can see it in the logs) but then none of the tasks completes (status: LOST). Spark retries the tasks up to 4 times and eventually gives up. I looked into the logs on the slave machines (the sandbox link in the UI) and I get the following output:
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0227 13:47:59.842319 17015 fetcher.cpp:76] Fetching URI '/home/user01/spark-1.2.1-bin-hadoop1.tgz'
I0227 13:47:59.842658 17015 fetcher.cpp:179] Copying resource from '/home/user01/spark-1.2.1-bin-hadoop1.tgz' to '/tmp/mesos/slaves/20150226-160235-2620919306-5050-14323-1/frameworks/20150227-132220-2620919306-5050-30420-0001/executors/20150226-160235-2620919306-5050-14323-1/runs/1978f267-cb47-4a6c-bd1f-69e99c00ae13'
I0227 13:48:09.896682 17015 fetcher.cpp:64] Extracted resource '/tmp/mesos/slaves/20150226-160235-2620919306-5050-14323-1/frameworks/20150227-132220-2620919306-5050-30420-0001/executors/20150226-160235-2620919306-5050-14323-1/runs/1978f267-cb47-4a6c-bd1f-69e99c00ae13/spark-1.2.1-bin-hadoop1.tgz' into '/tmp/mesos/slaves/20150226-160235-2620919306-5050-14323-1/frameworks/20150227-132220-2620919306-5050-30420-0001/executors/20150226-160235-2620919306-5050-14323-1/runs/1978f267-cb47-4a6c-bd1f-69e99c00ae13'
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/02/27 13:48:11 INFO MesosExecutorBackend: Registered signal handlers for [TERM, HUP, INT]
I0227 13:48:11.493357 17124 exec.cpp:132] Version: 0.20.1
I0227 13:48:11.496057 17142 exec.cpp:206] Executor registered on slave 20150226-160235-2620919306-5050-14323-1
15/02/27 13:48:11 INFO MesosExecutorBackend: Registered with Mesos as executor ID 20150226-160235-2620919306-5050-14323-1 with 1 cpus
15/02/27 13:48:11 INFO Executor: Starting executor ID 20150226-160235-2620919306-5050-14323-1 on host 10.14.56.160
15/02/27 13:48:11 INFO SecurityManager: Changing view acls to: user01
15/02/27 13:48:11 INFO SecurityManager: Changing modify acls to: user01
15/02/27 13:48:11 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user01); users with modify permissions: Set(user01)
15/02/27 13:48:12 INFO Slf4jLogger: Slf4jLogger started
15/02/27 13:48:12 INFO Remoting: Starting remoting
15/02/27 13:48:12 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor#10.14.56.160:42869]
15/02/27 13:48:12 INFO Utils: Successfully started service 'sparkExecutor' on port 42869.
15/02/27 13:48:12 INFO AkkaUtils: Connecting to MapOutputTracker: akka.tcp://sparkDriver#master:48886/user/MapOutputTracker
15/02/27 13:48:12 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://sparkDriver#master:48886]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: master: Name or service not known
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://sparkDriver#master:48886/), Path(/user/MapOutputTracker)]
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
at akka.remote.EndpointWriter.postStop(Endpoint.scala:561)
at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)
at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
at akka.actor.ActorCell.terminate(ActorCell.scala:369)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Exception in thread "Thread-1" I0227 13:48:12.364940 17142 exec.cpp:413] Deactivating the executor libprocess
The line where the error occurs says:
Tried to associate with unreachable remote address [akka.tcp://sparkDriver#master:48886]
It seems to me that the slave cannot resolve the name master to the master's IP. Is that correct? If so how to change it to the actual IP. If not, how to fix it? Thanks!

What happens if you type ping master on one of the slave machines? If that fails, that's your problem, and you could fix it by adding a line to each slave's /etc/hosts file pointing master to the correct IP.
You could also try setting spark.driver.host to its IP when launching the spark driver, to change what "host" it advertises itself as.

SPARK + Standalone Cluster: Cannot start worker from another machine

I've been setting up a Spark standalone cluster setup following this link. I have 2 machines; The first one (ubuntu0) serve as both the master and a worker, and the second one (ubuntu1) is just a worker. Password-less ssh has been properly configured for both machines already and was tested by doing SSH manually on both sides.
Now when I tried to ./start-all.ssh, both master and worker on the master machine (ubuntu0) were started properly. This is signified by (1)WebUI being accessible (localhost:8081 on my part) and (2) Worker registered/displayed on the WebUI.
However, the other worker on the second machine (ubuntu1), was not started. The error displayed was:
ubuntu1: ssh: connect to host ubuntu1 port 22: Connection timed out
Now this is quite weird already given that I've properly configured the ssh to be password-less on both sides. Given this, I accessed the second machine and tried to start the worker manually using these commands:
./spark-class org.apache.spark.deploy.worker.Worker spark://ubuntu0:7707
./spark-class org.apache.spark.deploy.worker.Worker spark://<ip>:7707
However, below is the result:
14/05/23 13:49:08 INFO Utils: Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
14/05/23 13:49:08 WARN Utils: Your hostname, ubuntu1 resolves to a loopback address:
127.0.1.1; using 192.168.122.1 instead (on interface virbr0)
14/05/23 13:49:08 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
14/05/23 13:49:09 INFO Slf4jLogger: Slf4jLogger started
14/05/23 13:49:09 INFO Remoting: Starting remoting
14/05/23 13:49:09 INFO Remoting: Remoting started; listening on addresses :
[akka.tcp://sparkWorker#ubuntu1.local:42739]
14/05/23 13:49:09 INFO Worker: Starting Spark worker ubuntu1.local:42739 with 8 cores,
4.8 GB RAM
14/05/23 13:49:09 INFO Worker: Spark home: /home/ubuntu1/jaysonp/spark/spark-0.9.1
14/05/23 13:49:09 INFO WorkerWebUI: Started Worker web UI at http://ubuntu1.local:8081
14/05/23 13:49:09 INFO Worker: Connecting to master spark://ubuntu0:7707...
14/05/23 13:49:29 INFO Worker: Connecting to master spark://ubuntu0:7707...
14/05/23 13:49:49 INFO Worker: Connecting to master spark://ubuntu0:7707...
14/05/23 13:50:09 ERROR Worker: All masters are unresponsive! Giving up.
Below are the contents of my master and slave\worker spark-env.ssh:
SPARK_MASTER_IP=192.168.3.222
STANDALONE_SPARK_MASTER_HOST=`hostname -f`
How should I resolve this? Thanks in advance!

For those who are still encountering error(s) when it comes to starting workers on different machines, I just want to share that using IP addresses in conf/slaves worked for me.
Hope this helps!

I have add similar issues today running spark 1.5.1 on RHEL 6.7.
I have 2 machines, their hostname being
- master.domain.com
- slave.domain.com
I installed a standalone version of spark (pre-build against hadoop 2.6) and installed my Oracle jdk-8u66.
Spark download:
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.5.1-bin-hadoop2.6.tgz
Java download
wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com%2F; oraclelicense=accept-securebackup-cookie" "http://download.oracle.com/otn-pub/java/jdk/8u66-b17/jdk-8u66-linux-x64.tar.gz"
after spark and java are unpacked in my home directory I did the following:
on 'master.domain.com' I ran:
./sbin/start-master.sh
The webUI become available at http://master.domain.com:8080 (no slave running)
on 'slave.domain.com' I did try:
./sbin/start-slave.sh spark://master.domain.com:7077 FAILED AS FOLLOW
Spark Command: /root/java/bin/java -cp /root/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/root/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/root/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/root/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://master.domain.com:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/11/06 11:03:51 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
15/11/06 11:03:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/06 11:03:51 INFO SecurityManager: Changing view acls to: root
15/11/06 11:03:51 INFO SecurityManager: Changing modify acls to: root
15/11/06 11:03:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/11/06 11:03:52 INFO Slf4jLogger: Slf4jLogger started
15/11/06 11:03:52 INFO Remoting: Starting remoting
15/11/06 11:03:52 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker#10.80.70.38:50573]
15/11/06 11:03:52 INFO Utils: Successfully started service 'sparkWorker' on port 50573.
15/11/06 11:03:52 INFO Worker: Starting Spark worker 10.80.70.38:50573 with 8 cores, 6.7 GB RAM
15/11/06 11:03:52 INFO Worker: Running Spark version 1.5.1
15/11/06 11:03:52 INFO Worker: Spark home: /root/spark-1.5.1-bin-hadoop2.6
15/11/06 11:03:53 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
15/11/06 11:03:53 INFO WorkerWebUI: Started WorkerWebUI at http://10.80.70.38:8081
15/11/06 11:03:53 INFO Worker: Connecting to master master.domain.com:7077...
15/11/06 11:04:05 INFO Worker: Retrying connection to master (attempt # 1)
15/11/06 11:04:05 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[sparkWorker-akka.actor.default-dispatcher-4,5,main]
java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask#48711bf5 rejected from java.util.concurrent.ThreadPoolExecutor#14db705b[Running, pool size = 1, active threads = 0, queued tasks = 0, completed tasks = 1]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:112)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1.apply(Worker.scala:211)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1.apply(Worker.scala:210)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters(Worker.scala:210)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$reregisterWithMaster$1.apply$mcV$sp(Worker.scala:288)
at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
at org.apache.spark.deploy.worker.Worker.org$apache$spark$deploy$worker$Worker$$reregisterWithMaster(Worker.scala:234)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:521)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:177)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:126)
at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:197)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:125)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:92)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/11/06 11:04:05 INFO ShutdownHookManager: Shutdown hook called
start-slave spark://<master-IP>:7077 also FAILED as above.
start-slave spark://master:7077 WORKED and the worker shows in the master webUI
Spark Command: /root/java/bin/java -cp /root/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/root/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/root/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/root/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar -Xms1g -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://master:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/11/06 11:08:15 INFO Worker: Registered signal handlers for [TERM, HUP, INT]
15/11/06 11:08:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/11/06 11:08:15 INFO SecurityManager: Changing view acls to: root
15/11/06 11:08:15 INFO SecurityManager: Changing modify acls to: root
15/11/06 11:08:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/11/06 11:08:16 INFO Slf4jLogger: Slf4jLogger started
15/11/06 11:08:16 INFO Remoting: Starting remoting
15/11/06 11:08:17 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker#10.80.70.38:40780]
15/11/06 11:08:17 INFO Utils: Successfully started service 'sparkWorker' on port 40780.
15/11/06 11:08:17 INFO Worker: Starting Spark worker 10.80.70.38:40780 with 8 cores, 6.7 GB RAM
15/11/06 11:08:17 INFO Worker: Running Spark version 1.5.1
15/11/06 11:08:17 INFO Worker: Spark home: /root/spark-1.5.1-bin-hadoop2.6
15/11/06 11:08:17 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
15/11/06 11:08:17 INFO WorkerWebUI: Started WorkerWebUI at http://10.80.70.38:8081
15/11/06 11:08:17 INFO Worker: Connecting to master master:7077...
15/11/06 11:08:17 INFO Worker: Successfully registered with master spark://master:7077
Note: I haven't added any extra config in conf/spark-env.sh
Note2: when looking in the master webUI, the spark master URL at the top is actually the one that worked for me, so I'd say in doubts just use that one.
I hope this helps ;)

Using hostname in /cong/slaves worked well for me.
Here are some steps I would take it,
Checked SSH public key
scp /etc/spark/conf.dist/spark-env.sh to your workers
My part of setting in spark-env.sh
export STANDALONE_SPARK_MASTER_HOST=hostname
export SPARK_MASTER_IP=$STANDALONE_SPARK_MASTER_HOST

I guess you missed something in your configuration, that's what I learned from your log.
Check your /etc/hosts, make sure ubuntu1 in your master's host list and its Ip is match the slave's IP address.
Add export SPARK_LOCAL_IP='ubuntu1' in the spark-env.sh file of your slave

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string