Spark-Submit cannot connect to ResourceManager - apache-spark

I'm trying to start a example spark job (hadoop is 2.7 and spark is 3.3.1) on my hadoop cluster containing of namenode and datanode0.
Upon running start-dfs.sh, I can see the datanode within the UI, and running jps on datanode returns me with "Datanode" process.
When I try to run the spark-submit with example, I get following output:
spark#namenode:~$ spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.12-3.3.1.jar 10
23/02/07 21:23:08 INFO SparkContext: Running Spark version 3.3.1
23/02/07 21:23:08 INFO ResourceUtils: ==============================================================
23/02/07 21:23:08 INFO ResourceUtils: No custom resources configured for spark.driver.
23/02/07 21:23:08 INFO ResourceUtils: ==============================================================
23/02/07 21:23:08 INFO SparkContext: Submitted application: Spark Pi
23/02/07 21:23:09 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 512, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
23/02/07 21:23:09 INFO ResourceProfile: Limiting resource is cpus at 1 tasks per executor
23/02/07 21:23:09 INFO ResourceProfileManager: Added ResourceProfile id: 0
23/02/07 21:23:09 INFO SecurityManager: Changing view acls to: spark
23/02/07 21:23:09 INFO SecurityManager: Changing modify acls to: spark
23/02/07 21:23:09 INFO SecurityManager: Changing view acls groups to:
23/02/07 21:23:09 INFO SecurityManager: Changing modify acls groups to:
23/02/07 21:23:09 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
23/02/07 21:23:09 INFO Utils: Successfully started service 'sparkDriver' on port 45161.
23/02/07 21:23:09 INFO SparkEnv: Registering MapOutputTracker
23/02/07 21:23:09 INFO SparkEnv: Registering BlockManagerMaster
23/02/07 21:23:09 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
23/02/07 21:23:09 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
23/02/07 21:23:09 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
23/02/07 21:23:09 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-09c3077f-8d16-4496-9808-0626cefe1cc7
23/02/07 21:23:09 INFO MemoryStore: MemoryStore started with capacity 93.3 MiB
23/02/07 21:23:09 INFO SparkEnv: Registering OutputCommitCoordinator
23/02/07 21:23:10 INFO Utils: Successfully started service 'SparkUI' on port 4040.
23/02/07 21:23:10 INFO SparkContext: Added JAR file:/home/spark/spark/examples/jars/spark-examples_2.12-3.3.1.jar at spark://namenode:45161/jars/spark-examples_2.12-3.3.1.jar with timestamp 1675801388738
23/02/07 21:23:10 INFO RMProxy: Connecting to ResourceManager at namenode/192.168.1.17:8032
23/02/07 21:23:11 INFO Client: Retrying connect to server: namenode/192.168.1.17:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
23/02/07 21:23:12 INFO Client: Retrying connect to server: namenode/192.168.1.17:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
23/02/07 21:23:13 INFO Client: Retrying connect to server: namenode/192.168.1.17:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
23/02/07 21:23:14 INFO Client: Retrying connect to server: namenode/192.168.1.17:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
23/02/07 21:23:15 INFO Client: Retrying connect to server: namenode/192.168.1.17:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
23/02/07 21:23:16 INFO Client: Retrying connect to server: namenode/192.168.1.17:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
^C23/02/07 21:23:17 INFO DiskBlockManager: Shutdown hook called
23/02/07 21:23:17 INFO ShutdownHookManager: Shutdown hook called
23/02/07 21:23:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-5a15bfc6-654d-4332-be86-3f20b4cf40f1/userFiles-049088cd-d541-4424-930b-dcaa18634860
23/02/07 21:23:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-4d529cc6-0bc8-43a9-b2af-3671ee11f963
23/02/07 21:23:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-5a15bfc6-654d-4332-be86-3f20b4cf40f1
Is it possible that I've messed up the yarn configuration?
Here's my etc hosts:
192.168.1.17 namenode
192.168.1.23 datanode0
This is my yarn-site.xml, maybe I've messed up the configuration?
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>namenode</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>${yarn.resourcemanager.hostname}:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>${yarn.resourcemanager.hostname}:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>${yarn.resourcemanager.hostname}:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>${yarn.resourcemanager.hostname}:8088</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>1536</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>1536</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>128</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
</configuration>

Related

Error while running spark client mode on mesos using docker

we have a 3 node Mesos cluster. Master service was started on machine 1 using below command:
sudo ./bin/mesos-master.sh --ip=machine1-ip --work_dir=/home/mapr/mesos/mesos-1.7.0/build/workDir --zk=zk://machine1-ip:2181/mesos --quorum=1
and agent services on other 2 machines using below command:
sudo ./bin/mesos-agent.sh --containerizers=docker --master=zk://machine1-ip:2181/mesos --work_dir=/home/mapr/mesos/mesos-1.7.0/build/workDir --ip=machine2-ip --no-systemd_enable_support
sudo ./bin/mesos-agent.sh --containerizers=docker --master=zk://machine1-ip:2181/mesos --work_dir=/home/mapr/mesos/mesos-1.7.0/build/workDir --ip=machine3-ip --no-systemd_enable_support
Below property was set in machine1:
export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
we are trying to run spark job using docker image.
Note that we did not set "SPARK_EXECUTOR_URI" in machine1 because as per our understanding executor is going to run inside docker container and not on slave machine and hence this property is not required.
command used for spark submit is below(from machine 1):
/home/mapr/newSpark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit \
--master mesos://machine1:5050 \
--deploy-mode client \
--class com.learning.spark.WordCount \
--conf spark.mesos.executor.docker.image=mesosphere/spark:2.4.0-2.2.1-3-hadoop-2.7 \
/home/mapr/mesos/wordcount.jar hdfs://machine2:8020/hdfslocation/input.txt hdfs://machine2:8020/hdfslocation/output
we are getting below error on spark submit:
Mesos task log:
I1211 20:27:55.040856 5996 exec.cpp:162] Version: 1.7.0
I1211 20:27:55.064775 6016 exec.cpp:236] Executor registered on agent 44c2e848-cd06-4546-b0e9-15537084df1b-S1
I1211 20:27:55.068828 6018 executor.cpp:130] Registered docker executor on company-i0058.company.co.in
I1211 20:27:55.069756 6016 executor.cpp:186] Starting task 3
/bin/sh: 1: /home/mapr/newSpark/spark-2.4.0-bin-hadoop2.7/./bin/spark-class: not found
I1211 20:27:57.669881 6017 executor.cpp:736] Container exited with status 127
I1211 20:27:58.672829 6019 process.cpp:926] Stopped the socket accept loop
messages on the terminal:
2018-12-11 20:27:49 INFO SparkContext:54 - Running Spark version 2.4.0
2018-12-11 20:27:49 INFO SparkContext:54 - Submitted application: WordCount
2018-12-11 20:27:49 INFO SecurityManager:54 - Changing view acls to: mapr
2018-12-11 20:27:49 INFO SecurityManager:54 - Changing modify acls to: mapr
2018-12-11 20:27:49 INFO SecurityManager:54 - Changing view acls groups to:
2018-12-11 20:27:49 INFO SecurityManager:54 - Changing modify acls groups to:
2018-12-11 20:27:49 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mapr); groups with view permissions: Set(); users with modify permissions: Set(mapr); groups with modify permissions: Set()
2018-12-11 20:27:49 INFO Utils:54 - Successfully started service 'sparkDriver' on port 48069.
2018-12-11 20:27:49 INFO SparkEnv:54 - Registering MapOutputTracker
2018-12-11 20:27:49 INFO SparkEnv:54 - Registering BlockManagerMaster
2018-12-11 20:27:49 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2018-12-11 20:27:49 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2018-12-11 20:27:49 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-3a4afff7-b050-45ba-bb50-c9f4ec5cc031
2018-12-11 20:27:49 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB
2018-12-11 20:27:49 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2018-12-11 20:27:49 INFO log:192 - Logging initialized #3157ms
2018-12-11 20:27:50 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2018-12-11 20:27:50 INFO Server:419 - Started #3273ms
2018-12-11 20:27:50 INFO AbstractConnector:278 - Started ServerConnector#1cfd1875{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-12-11 20:27:50 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040.
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6f0628de{/jobs,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#2b27cc70{/jobs/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6f6a7463{/jobs/job,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#79f227a9{/jobs/job/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6ca320ab{/stages,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#50d68830{/stages/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#1e53135d{/stages/stage,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6754ef00{/stages/stage/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#619bd14c{/stages/pool,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#323e8306{/stages/pool/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#a23a01d{/storage,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#4acf72b6{/storage/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#7561db12{/storage/rdd,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#3301500b{/storage/rdd/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#24b52d3e{/environment,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#15deb1dc{/environment/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6e9c413e{/executors,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#57a4d5ee{/executors/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#5af5def9{/executors/threadDump,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#3a45c42a{/executors/threadDump/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#36dce7ed{/static,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#4b770e40{/,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#78e16155{/api,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#19868320{/jobs/job/kill,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#50b0bc4c{/stages/stage/kill,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://machine1:4040
2018-12-11 20:27:50 INFO SparkContext:54 - Added JAR file:/home/mapr/mesos/wordcount.jar at spark://machine1:48069/jars/wordcount.jar with timestamp 1544540270193
I1211 20:27:50.557170 7462 sched.cpp:232] Version: 1.7.0
I1211 20:27:50.560644 7454 sched.cpp:336] New master detected at master#machine1:5050
I1211 20:27:50.561132 7454 sched.cpp:356] No credentials provided. Attempting to register without authentication
I1211 20:27:50.571651 7456 sched.cpp:744] Framework registered with 5260e4c8-de1c-4772-b5a7-340480594ef4-0000
2018-12-11 20:27:50 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 56351.
2018-12-11 20:27:50 INFO NettyBlockTransferService:54 - Server created on machine1:56351
2018-12-11 20:27:50 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2018-12-11 20:27:50 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, impetus-i0053.impetus.co.in, 56351, None)
2018-12-11 20:27:50 INFO BlockManagerMasterEndpoint:54 - Registering block manager machine1:56351 with 366.3 MB RAM, BlockManagerId(driver, impetus-i0053.impetus.co.in, 56351, None)
2018-12-11 20:27:50 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, machine1, 56351, None)
2018-12-11 20:27:50 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, machine1, 56351, None)
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#73ba6fe6{/metrics/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO MesosCoarseGrainedSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
2018-12-11 20:27:51 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 0 is now TASK_STARTING
2018-12-11 20:27:51 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 1 is now TASK_STARTING
2018-12-11 20:27:51 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 288.1 KB, free 366.0 MB)
2018-12-11 20:27:51 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 25.1 KB, free 366.0 MB)
2018-12-11 20:27:51 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on machine1:56351 (size: 25.1 KB, free: 366.3 MB)
2018-12-11 20:27:51 INFO SparkContext:54 - Created broadcast 0 from textFile at WordCount.scala:22
2018-12-11 20:27:52 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-12-11 20:27:52 INFO FileInputFormat:249 - Total input paths to process : 1
2018-12-11 20:27:53 INFO deprecation:1173 - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
2018-12-11 20:27:53 INFO HadoopMapRedCommitProtocol:54 - Using output committer class org.apache.hadoop.mapred.FileOutputCommitter
2018-12-11 20:27:53 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-12-11 20:27:53 INFO SparkContext:54 - Starting job: runJob at SparkHadoopWriter.scala:78
2018-12-11 20:27:53 INFO DAGScheduler:54 - Registering RDD 3 (map at WordCount.scala:24)
2018-12-11 20:27:53 INFO DAGScheduler:54 - Got job 0 (runJob at SparkHadoopWriter.scala:78) with 2 output partitions
2018-12-11 20:27:53 INFO DAGScheduler:54 - Final stage: ResultStage 1 (runJob at SparkHadoopWriter.scala:78)
2018-12-11 20:27:53 INFO DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 0)
2018-12-11 20:27:53 INFO DAGScheduler:54 - Missing parents: List(ShuffleMapStage 0)
2018-12-11 20:27:53 INFO DAGScheduler:54 - Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:24), which has no missing parents
2018-12-11 20:27:53 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 1 is now TASK_RUNNING
2018-12-11 20:27:53 INFO MemoryStore:54 - Block broadcast_1 stored as values in memory (estimated size 5.0 KB, free 366.0 MB)
2018-12-11 20:27:53 INFO MemoryStore:54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.9 KB, free 366.0 MB)
2018-12-11 20:27:53 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in memory on machine1:56351 (size: 2.9 KB, free: 366.3 MB)
2018-12-11 20:27:53 INFO SparkContext:54 - Created broadcast 1 from broadcast at DAGScheduler.scala:1161
2018-12-11 20:27:53 INFO DAGScheduler:54 - Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:24) (first 15 tasks are for partitions Vector(0, 1))
2018-12-11 20:27:53 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks
2018-12-11 20:27:53 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 0 is now TASK_RUNNING
2018-12-11 20:27:54 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 0 is now TASK_FAILED
2018-12-11 20:27:54 INFO BlockManagerMaster:54 - Removal of executor 0 requested
2018-12-11 20:27:54 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 0
2018-12-11 20:27:54 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 0 from BlockManagerMaster.
2018-12-11 20:27:54 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 1 is now TASK_FAILED
2018-12-11 20:27:54 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 1 from BlockManagerMaster.
2018-12-11 20:27:54 INFO BlockManagerMaster:54 - Removal of executor 1 requested
2018-12-11 20:27:54 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 1
2018-12-11 20:27:54 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 2 is now TASK_STARTING
2018-12-11 20:27:55 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 3 is now TASK_STARTING
2018-12-11 20:27:57 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 2 is now TASK_RUNNING
2018-12-11 20:27:57 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 2 is now TASK_FAILED
2018-12-11 20:27:57 INFO MesosCoarseGrainedSchedulerBackend:54 - Blacklisting Mesos slave b92da3e9-a9c4-422a-babe-c5fb0f33e027-S0 due to too many failures; is Spark installed on it?
2018-12-11 20:27:57 INFO BlockManagerMaster:54 - Removal of executor 2 requested
2018-12-11 20:27:57 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 2
2018-12-11 20:27:57 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 2 from BlockManagerMaster.
2018-12-11 20:27:57 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 3 is now TASK_RUNNING
2018-12-11 20:27:57 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 3 is now TASK_FAILED
2018-12-11 20:27:57 INFO MesosCoarseGrainedSchedulerBackend:54 - Blacklisting Mesos slave 44c2e848-cd06-4546-b0e9-15537084df1b-S1 due to too many failures; is Spark installed on it?
2018-12-11 20:27:57 INFO BlockManagerMaster:54 - Removal of executor 3 requested
2018-12-11 20:27:57 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 3 from BlockManagerMaster.
2018-12-11 20:27:57 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 3
2018-12-11 20:28:08 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Apache Spark: worker can't connect to master but can ping and ssh from worker to master

I'm trying to setup an 8-node cluster on 8 RHEL 7.3 x86 machines using Spark 2.0.1. start-master.sh goes through fine:
Spark Command: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.102-4.b14.el7.x86_64/jre/bin/java -cp /usr/local/bin/spark-2.0.1-bin-hadoop2.7/conf/:/usr/local/bin/spark-2.0.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.master.Master --host lambda.foo.net --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/08 04:26:46 INFO Master: Started daemon with process name: 22181#lambda.foo.net
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for TERM
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for HUP
16/12/08 04:26:46 INFO SignalUtils: Registered signal handler for INT
16/12/08 04:26:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/08 04:26:46 INFO SecurityManager: Changing view acls to: root
16/12/08 04:26:46 INFO SecurityManager: Changing modify acls to: root
16/12/08 04:26:46 INFO SecurityManager: Changing view acls groups to:
16/12/08 04:26:46 INFO SecurityManager: Changing modify acls groups to:
16/12/08 04:26:46 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/12/08 04:26:46 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
16/12/08 04:26:46 INFO Master: Starting Spark master at spark://lambda.foo.net:7077
16/12/08 04:26:46 INFO Master: Running Spark version 2.0.1
16/12/08 04:26:46 INFO Utils: Successfully started service 'MasterUI' on port 8080.
16/12/08 04:26:46 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://19.341.11.212:8080
16/12/08 04:26:46 INFO Utils: Successfully started service on port 6066.
16/12/08 04:26:46 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
16/12/08 04:26:46 INFO Master: I have been elected leader! New state: ALIVE
But when I try to bring up the workers, using start-slaves.sh, what I see in the log of the workers is:
Spark Command: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.102-4.b14.el7.x86_64/jre/bin/java -cp /usr/local/bin/spark-2.0.1-bin-hadoop2.7/conf/:/usr/local/bin/spark-2.0.1-bin-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://lambda.foo.net:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/12/08 04:30:00 INFO Worker: Started daemon with process name: 14649#hawk040os4.foo.net
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for TERM
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for HUP
16/12/08 04:30:00 INFO SignalUtils: Registered signal handler for INT
16/12/08 04:30:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/12/08 04:30:00 INFO SecurityManager: Changing view acls to: root
16/12/08 04:30:00 INFO SecurityManager: Changing modify acls to: root
16/12/08 04:30:00 INFO SecurityManager: Changing view acls groups to:
16/12/08 04:30:00 INFO SecurityManager: Changing modify acls groups to:
16/12/08 04:30:00 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/12/08 04:30:00 INFO Utils: Successfully started service 'sparkWorker' on port 35858.
16/12/08 04:30:00 INFO Worker: Starting Spark worker 15.242.22.179:35858 with 24 cores, 1510.2 GB RAM
16/12/08 04:30:00 INFO Worker: Running Spark version 2.0.1
16/12/08 04:30:00 INFO Worker: Spark home: /usr/local/bin/spark-2.0.1-bin-hadoop2.7
16/12/08 04:30:00 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
16/12/08 04:30:00 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://15.242.22.179:8081
16/12/08 04:30:00 INFO Worker: Connecting to master lambda.foo.net:7077...
16/12/08 04:30:00 WARN Worker: Failed to connect to master lambda.foo.net:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:88)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:96)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:216)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to lambda.foo.net/19.341.11.212:7077
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:197)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:191)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:187)
... 4 more
Caused by: java.net.NoRouteToHostException: No route to host: lambda.foo.net/19.341.11.212:7077
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
16/12/08 04:30:12 INFO Worker: Retrying connection to master (attempt # 1)
16/12/08 04:30:12 INFO Worker: Connecting to master lambda.foo.net:7077...
16/12/08 04:30:12 WARN Worker: Failed to connect to master lambda.foo.net:7077
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
So it says "No route to host". But I could successfully ping the master from the worker node, as well as ssh from the worker to the master node.
Why does spark say "No route to host"?
Problem solved: the firewall was blocking the packets.

Cannot run spark v2.0.0 example on cluster

So I have set up a Spark cluster. But I can't actually get it to work. When I submit the SparkPi example with:
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master spark://x.y.129.163:7077 \
--deploy-mode cluster \
--supervise \
--executor-memory 20G \
--total-executor-cores 2 \
examples/jars/spark-examples_2.11-2.0.0.jar 1000
I get the following from the worker logs:
Spark Command: /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111-2.6.7.2.el7_2.x86_64/jre/bin/java -cp /opt/spark/spark-2.0.0-bin-hadoop2.7/conf/:/opt/spark/spark-2.0.0-bin-hadoop2.7/jars/* -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://mesos-master:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/18 09:20:56 INFO Worker: Started daemon with process name: 23949#mesos-slave-4.novalocal
16/09/18 09:20:56 INFO SignalUtils: Registered signal handler for TERM
16/09/18 09:20:56 INFO SignalUtils: Registered signal handler for HUP
16/09/18 09:20:56 INFO SignalUtils: Registered signal handler for INT
16/09/18 09:20:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/09/18 09:20:56 INFO SecurityManager: Changing view acls to: root
16/09/18 09:20:56 INFO SecurityManager: Changing modify acls to: root
16/09/18 09:20:56 INFO SecurityManager: Changing view acls groups to:
16/09/18 09:20:56 INFO SecurityManager: Changing modify acls groups to:
16/09/18 09:20:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/09/18 09:21:00 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy?
16/09/18 09:21:00 INFO Utils: Successfully started service 'sparkWorker' on port 55256.
16/09/18 09:21:00 INFO Worker: Starting Spark worker x.y.129.162:55256 with 4 cores, 6.6 GB RAM
16/09/18 09:21:00 INFO Worker: Running Spark version 2.0.0
16/09/18 09:21:00 INFO Worker: Spark home: /opt/spark/spark-2.0.0-bin-hadoop2.7
16/09/18 09:21:00 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
16/09/18 09:21:00 INFO WorkerWebUI: Bound WorkerWebUI to x.y.129.162, and started at http://x.y.129.162:8081
16/09/18 09:21:00 INFO Worker: Connecting to master mesos-master:7077...
16/09/18 09:21:00 INFO TransportClientFactory: Successfully created connection to mesos-master/x.y.129.163:7077 after 33 ms (0 ms spent in bootstraps)
16/09/18 09:21:00 INFO Worker: Successfully registered with master spark://x.y.129.163:7077
16/09/18 09:21:00 INFO Worker: Asked to launch driver driver-20160918090435-0001
16/09/18 09:21:01 INFO DriverRunner: Launch Command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111-2.6.7.2.el7_2.x86_64/jre/bin/java" "-cp" "/opt/spark/spark-2.0.0-bin-hadoop2.7/conf/:/opt/spark/spark-2.0.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.executor.memory=20G" "-Dspark.submit.deployMode=cluster" "-Dspark.app.name=org.apache.spark.examples.SparkPi" "-Dspark.cores.max=2" "-Dspark.rpc.askTimeout=10" "-Dspark.driver.supervise=true" "-Dspark.jars=file:/opt/spark/spark-2.0.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.0.0.jar" "-Dspark.master=spark://x.y.129.163:7077" "-XX:MaxPermSize=256m" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker#x.y.129.162:55256" "/opt/spark/spark-2.0.0-bin-hadoop2.7/work/driver-20160918090435-0001/spark-examples_2.11-2.0.0.jar" "org.apache.spark.examples.SparkPi" "1000"
16/09/18 09:21:06 INFO DriverRunner: Command exited with status 1, re-launching after 1 s.
16/09/18 09:21:07 INFO DriverRunner: Launch Command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111-2.6.7.2.el7_2.x86_64/jre/bin/java" "-cp" "/opt/spark/spark-2.0.0-bin-hadoop2.7/conf/:/opt/spark/spark-2.0.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.executor.memory=20G" "-Dspark.submit.deployMode=cluster" "-Dspark.app.name=org.apache.spark.examples.SparkPi" "-Dspark.cores.max=2" "-Dspark.rpc.askTimeout=10" "-Dspark.driver.supervise=true" "-Dspark.jars=file:/opt/spark/spark-2.0.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.0.0.jar" "-Dspark.master=spark://x.y.129.163:7077" "-XX:MaxPermSize=256m" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker#x.y.129.162:55256" "/opt/spark/spark-2.0.0-bin-hadoop2.7/work/driver-20160918090435-0001/spark-examples_2.11-2.0.0.jar" "org.apache.spark.examples.SparkPi" "1000"
16/09/18 09:21:12 INFO DriverRunner: Command exited with status 1, re-launching after 1 s.
i.e. the job/driver appears to be failing and then retrying indefinitely.
When I look at the starter from the driver on the worker node I see:
Launch Command: "/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.111-2.6.7.2.el7_2.x86_64/jre/bin/java" "-cp" "/opt/spark/spark-2.0.0-bin-hadoop2.7/conf/:/opt/spark/spark-2.0.0-bin-hadoop2.7/jars/*" "-Xmx1024M" "-Dspark.executor.memory=20G" "-Dspark.submit.deployMode=cluster" "-Dspark.app.name=org.apache.spark.examples.SparkPi" "-Dspark.cores.max=2" "-Dspark.rpc.askTimeout=10" "-Dspark.driver.supervise=true" "-Dspark.jars=file:/opt/spark/spark-2.0.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.0.0.jar" "-Dspark.master=spark://x.y.129.163:7077" "-XX:MaxPermSize=256m" "org.apache.spark.deploy.worker.DriverWrapper" "spark://Worker#x.y.129.162:33364" "/opt/spark/spark-2.0.0-bin-hadoop2.7/work/driver-20160918090435-0001/spark-examples_2.11-2.0.0.jar" "org.apache.spark.examples.SparkPi" "1000"
========================================
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/09/18 09:13:18 INFO SecurityManager: Changing view acls to: root
16/09/18 09:13:18 INFO SecurityManager: Changing modify acls to: root
16/09/18 09:13:18 INFO SecurityManager: Changing view acls groups to:
16/09/18 09:13:18 INFO SecurityManager: Changing modify acls groups to:
16/09/18 09:13:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
16/09/18 09:13:21 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy?
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
16/09/18 09:13:22 WARN Utils: Service 'Driver' could not bind on port 0. Attempting port 1.
Exception in thread "main" java.net.BindException: Cannot assign requested address: Service 'Driver' failed after 16 retries! Consider explicitly setting the appropriate port for the service 'Driver' (for example spark.ui.port for SparkUI) to an available port or increasing spark.port.maxRetries.
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:463)
at sun.nio.ch.Net.bind(Net.java:455)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:125)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:485)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1089)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:430)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:415)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:903)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:198)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:348)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
(excuse the timestamps, the logs are the same later on)
On the master I have:
# /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 mesos-master
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
x.y.129.155 mesos-slave-1
x.y.129.161 mesos-slave-2
x.y.129.160 mesos-slave-3
x.y.129.162 mesos-slave-4
# conf/spark-env.sh
#!/usr/bin/env bash
SPARK_MASTER_HOST=x.y.129.163
SPARK_LOCAL_IP=x.y.129.163
And for the worker I have:
# /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 mesos-slave-4 mesos-slave-4.novalocal
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
x.y.129.163 mesos-master
# conf/spark-env.sh
#!/usr/bin/env bash
SPARK_MASTER_HOST=x.y.129.163
SPARK_LOCAL_IP=x.y.129.162
I also disabled all ipv6 from /etc/sysctl.conf.
All daemons are started with the sbin/start-master.sh and sbin/start-slave.sh spark://x.y.129.163:7077 commands.
Update: so I attempted the spark-submit again, but without the --deploy-mode cluster.... and it works! Any idea why it doesn't with cluster mode?

Apache Spark job reading from Cassandra table stalls on launch (spark-1.3.1)

We've been having intermittent issues with Spark 1.3.1 and the datastax Cassandra connector causing jobs to stall indefinitely when they are launched.
EDIT: I also tried the same approach with Spark 1.2.1 and the packaged 1.2.1 spark-cassandra-connector_2.10 and it resulted in the same symptoms.
We are using the following dependency:
var sparkCas = "com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.3.0-SNAPSHOT"
Our job code:
object ConnTransform {
private val AppName = "ConnTransformCassandra"
def main(args: Array[String]) {
val start = new DateTime(2015, 5, 27, 1, 0, 0)
val master = if (args.length >= 1) args(0) else "local[*]"
// Create the spark context.
val sc = {
val conf = new SparkConf()
.setAppName(AppName)
.setMaster(master)
.set("spark.cassandra.connection.host", "10.10.101.202,10.10.102.139,10.10.103.74")
new SparkContext(conf)
}
sc.cassandraTable("alpha_dev", "conn")
.select("data")
.where("timep = ?", start)
.where("sensorid IN ?", Utils.sensors)
.map(Utils.deserializeRow)
.saveAsTextFile("output/raw_data")
}
}
As you can see, the code is pretty simple (and it was more complex, but we've been attempting to narrow down the root cause of this issue).
Now, this job worked earlier today - data was successfully put into the directory specified. However, now when it is run we see the job start, get to the point just before it starts processing blocks, and sit there indefinitely.
Output from the job below shows the log messages seen so far, and at the time of writing the job has been stalled for almost an hour. If we set the logging level to DEBUG the only thing you see after that point in the job are heartbeat pings between akka workers.
ubuntu#ip-10-10-102-53:~/projects/icespark$ /home/ubuntu/spark/spark-1.3.1/bin/spark-submit --class com.splee.spark.ConnTransform splee-analytics-assembly-0.1.0.jar
15/05/27 21:15:21 INFO SparkContext: Running Spark version 1.3.1
15/05/27 21:15:21 INFO SecurityManager: Changing view acls to: ubuntu
15/05/27 21:15:21 INFO SecurityManager: Changing modify acls to: ubuntu
15/05/27 21:15:21 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ubuntu); users with modify permissions: Set(ubuntu)
15/05/27 21:15:22 INFO Slf4jLogger: Slf4jLogger started
15/05/27 21:15:22 INFO Remoting: Starting remoting
15/05/27 21:15:22 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#ip-10-10-102-53.us-west-2.compute.internal:51977]
15/05/27 21:15:22 INFO Utils: Successfully started service 'sparkDriver' on port 51977.
15/05/27 21:15:22 INFO SparkEnv: Registering MapOutputTracker
15/05/27 21:15:22 INFO SparkEnv: Registering BlockManagerMaster
15/05/27 21:15:22 INFO DiskBlockManager: Created local directory at /tmp/spark-2466ff66-bb50-4d52-9d34-1801d69889b9/blockmgr-60e75214-1ba6-410c-a564-361263636e5c
15/05/27 21:15:22 INFO MemoryStore: MemoryStore started with capacity 265.1 MB
15/05/27 21:15:22 INFO HttpFileServer: HTTP File server directory is /tmp/spark-72f1e849-c298-49ee-936c-e94c462f3df2/httpd-f81c2326-e5f1-4f33-9557-074f2789c4ee
15/05/27 21:15:22 INFO HttpServer: Starting HTTP Server
15/05/27 21:15:22 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/27 21:15:22 INFO AbstractConnector: Started SocketConnector#0.0.0.0:55357
15/05/27 21:15:22 INFO Utils: Successfully started service 'HTTP file server' on port 55357.
15/05/27 21:15:22 INFO SparkEnv: Registering OutputCommitCoordinator
15/05/27 21:15:22 INFO Server: jetty-8.y.z-SNAPSHOT
15/05/27 21:15:22 INFO AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/05/27 21:15:22 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/05/27 21:15:22 INFO SparkUI: Started SparkUI at http://ip-10-10-102-53.us-west-2.compute.internal:4040
15/05/27 21:15:22 INFO SparkContext: Added JAR file:/home/ubuntu/projects/icespark/splee-analytics-assembly-0.1.0.jar at http://10.10.102.53:55357/jars/splee-analytics-assembly-0.1.0.jar with timestamp 1432761322942
15/05/27 21:15:23 INFO Executor: Starting executor ID <driver> on host localhost
15/05/27 21:15:23 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#ip-10-10-102-53.us-west-2.compute.internal:51977/user/HeartbeatReceiver
15/05/27 21:15:23 INFO NettyBlockTransferService: Server created on 58479
15/05/27 21:15:23 INFO BlockManagerMaster: Trying to register BlockManager
15/05/27 21:15:23 INFO BlockManagerMasterActor: Registering block manager localhost:58479 with 265.1 MB RAM, BlockManagerId(<driver>, localhost, 58479)
15/05/27 21:15:23 INFO BlockManagerMaster: Registered BlockManager
15/05/27 21:15:24 INFO Cluster: New Cassandra host /10.10.101.28:9042 added
15/05/27 21:15:24 INFO LocalNodeFirstLoadBalancingPolicy: Added host 10.10.101.28 (us-west-2)
15/05/27 21:15:24 INFO Cluster: New Cassandra host /10.10.103.60:9042 added
15/05/27 21:15:24 INFO LocalNodeFirstLoadBalancingPolicy: Added host 10.10.103.60 (us-west-2)
15/05/27 21:15:24 INFO Cluster: New Cassandra host /10.10.102.154:9042 added
15/05/27 21:15:24 INFO LocalNodeFirstLoadBalancingPolicy: Added host 10.10.102.154 (us-west-2)
15/05/27 21:15:24 INFO Cluster: New Cassandra host /10.10.101.145:9042 added
15/05/27 21:15:24 INFO LocalNodeFirstLoadBalancingPolicy: Added host 10.10.101.145 (us-west-2)
15/05/27 21:15:24 INFO Cluster: New Cassandra host /10.10.103.78:9042 added
15/05/27 21:15:24 INFO LocalNodeFirstLoadBalancingPolicy: Added host 10.10.103.78 (us-west-2)
15/05/27 21:15:24 INFO Cluster: New Cassandra host /10.10.102.200:9042 added
15/05/27 21:15:24 INFO LocalNodeFirstLoadBalancingPolicy: Added host 10.10.102.200 (us-west-2)
15/05/27 21:15:24 INFO Cluster: New Cassandra host /10.10.102.73:9042 added
15/05/27 21:15:24 INFO LocalNodeFirstLoadBalancingPolicy: Added host 10.10.102.73 (us-west-2)
15/05/27 21:15:24 INFO Cluster: New Cassandra host /10.10.103.205:9042 added
15/05/27 21:15:24 INFO LocalNodeFirstLoadBalancingPolicy: Added host 10.10.103.205 (us-west-2)
15/05/27 21:15:24 INFO Cluster: New Cassandra host /10.10.101.205:9042 added
15/05/27 21:15:24 INFO LocalNodeFirstLoadBalancingPolicy: Added host 10.10.101.205 (us-west-2)
15/05/27 21:15:24 INFO Cluster: New Cassandra host /10.10.103.74:9042 added
15/05/27 21:15:24 INFO Cluster: New Cassandra host /10.10.101.202:9042 added
15/05/27 21:15:24 INFO Cluster: New Cassandra host /10.10.102.139:9042 added
15/05/27 21:15:24 INFO CassandraConnector: Connected to Cassandra cluster: Splee Dev
15/05/27 21:15:25 INFO CassandraConnector: Disconnected from Cassandra cluster: Splee Dev
If anyone has any ideas what could be causing this job (which previously produced results) to stall in this way and can shed some light on the situation it would be much appreciated.

Spark fail when running pi.py example with yarn-client mode

I can successfully run the java version of pi example as follows.
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
--master yarn-client \
--num-executors 3 \
--driver-memory 4g \
--executor-memory 2g \
--executor-cores 1 \
--queue thequeue \
lib/spark-examples*.jar \
10
However, the python version failed with the following error information. I used yarn-client mode. The pyspark command line with yarn-client mode returned the same info. Can anyone help me to figure out this problem?
nlp#yyy2:~/spark$ ./bin/spark-submit --master yarn-client examples/src/main/python/pi.py
15/01/05 17:22:26 INFO spark.SecurityManager: Changing view acls to: nlp
15/01/05 17:22:26 INFO spark.SecurityManager: Changing modify acls to: nlp
15/01/05 17:22:26 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nlp); users with modify permissions: Set(nlp)
15/01/05 17:22:26 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/01/05 17:22:26 INFO Remoting: Starting remoting
15/01/05 17:22:26 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#yyy2:42747]
15/01/05 17:22:26 INFO util.Utils: Successfully started service 'sparkDriver' on port 42747.
15/01/05 17:22:26 INFO spark.SparkEnv: Registering MapOutputTracker
15/01/05 17:22:26 INFO spark.SparkEnv: Registering BlockManagerMaster
15/01/05 17:22:26 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-local-20150105172226-aeae
15/01/05 17:22:26 INFO storage.MemoryStore: MemoryStore started with capacity 265.1 MB
15/01/05 17:22:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/01/05 17:22:27 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-cbe0079b-79c5-426b-b67e-548805423b11
15/01/05 17:22:27 INFO spark.HttpServer: Starting HTTP Server
15/01/05 17:22:27 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/01/05 17:22:27 INFO server.AbstractConnector: Started SocketConnector#0.0.0.0:57169
15/01/05 17:22:27 INFO util.Utils: Successfully started service 'HTTP file server' on port 57169.
15/01/05 17:22:27 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/01/05 17:22:27 INFO server.AbstractConnector: Started SelectChannelConnector#0.0.0.0:4040
15/01/05 17:22:27 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
15/01/05 17:22:27 INFO ui.SparkUI: Started SparkUI at http://yyy2:4040
15/01/05 17:22:27 INFO client.RMProxy: Connecting to ResourceManager at yyy14/10.112.168.195:8032
15/01/05 17:22:27 INFO yarn.Client: Requesting a new application from cluster with 6 NodeManagers
15/01/05 17:22:27 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container)
15/01/05 17:22:27 INFO yarn.Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
15/01/05 17:22:27 INFO yarn.Client: Setting up container launch context for our AM
15/01/05 17:22:27 INFO yarn.Client: Preparing resources for our AM container
15/01/05 17:22:28 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 24 for xxx on ha-hdfs:hzdm-cluster1
15/01/05 17:22:28 INFO yarn.Client: Uploading resource file:/home/nlp/platform/spark-1.2.0-bin-2.5.2/lib/spark-assembly-1.2.0-hadoop2.5.2.jar -> hdfs://hzdm-cluster1/user/nlp/.sparkStaging/application_1420444011562_0023/spark-assembly-1.2.0-hadoop2.5.2.jar
15/01/05 17:22:29 INFO yarn.Client: Uploading resource file:/home/nlp/platform/spark-1.2.0-bin-2.5.2/examples/src/main/python/pi.py -> hdfs://hzdm-cluster1/user/nlp/.sparkStaging/application_1420444011562_0023/pi.py
15/01/05 17:22:29 INFO yarn.Client: Setting up the launch environment for our AM container
15/01/05 17:22:29 INFO spark.SecurityManager: Changing view acls to: nlp
15/01/05 17:22:29 INFO spark.SecurityManager: Changing modify acls to: nlp
15/01/05 17:22:29 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nlp); users with modify permissions: Set(nlp)
15/01/05 17:22:29 INFO yarn.Client: Submitting application 23 to ResourceManager
15/01/05 17:22:30 INFO impl.YarnClientImpl: Submitted application application_1420444011562_0023
15/01/05 17:22:31 INFO yarn.Client: Application report for application_1420444011562_0023 (state: ACCEPTED)
15/01/05 17:22:31 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.default
start time: 1420449749969
final status: UNDEFINED
tracking URL: http://yyy14:8070/proxy/application_1420444011562_0023/
user: nlp
15/01/05 17:22:32 INFO yarn.Client: Application report for application_1420444011562_0023 (state: ACCEPTED)
15/01/05 17:22:33 INFO yarn.Client: Application report for application_1420444011562_0023 (state: ACCEPTED)
15/01/05 17:22:34 INFO yarn.Client: Application report for application_1420444011562_0023 (state: ACCEPTED)
15/01/05 17:22:35 INFO yarn.Client: Application report for application_1420444011562_0023 (state: ACCEPTED)
15/01/05 17:22:36 INFO yarn.Client: Application report for application_1420444011562_0023 (state: ACCEPTED)
15/01/05 17:22:36 INFO cluster.YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkYarnAM#yyy16:52855/user/YarnAM#435880073]
15/01/05 17:22:36 INFO cluster.YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> yyy14, PROXY_URI_BASES -> http://yyy14:8070/proxy/application_1420444011562_0023), /proxy/application_1420444011562_0023
15/01/05 17:22:36 INFO ui.JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/01/05 17:22:37 INFO yarn.Client: Application report for application_1420444011562_0023 (state: RUNNING)
15/01/05 17:22:37 INFO yarn.Client:
client token: Token { kind: YARN_CLIENT_TOKEN, service: }
diagnostics: N/A
ApplicationMaster host: yyy16
ApplicationMaster RPC port: 0
queue: root.default
start time: 1420449749969
final status: UNDEFINED
tracking URL: http://yyy14:8070/proxy/application_1420444011562_0023/
user: nlp
15/01/05 17:22:37 INFO cluster.YarnClientSchedulerBackend: Application application_1420444011562_0023 has started running.
15/01/05 17:22:37 INFO netty.NettyBlockTransferService: Server created on 35648
15/01/05 17:22:37 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/01/05 17:22:37 INFO storage.BlockManagerMasterActor: Registering block manager yyy2:35648 with 265.1 MB RAM, BlockManagerId(<driver>, yyy2, 35648)
15/01/05 17:22:37 INFO storage.BlockManagerMaster: Registered BlockManager
15/01/05 17:22:37 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkYarnAM#yyy16:52855] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
15/01/05 17:22:38 ERROR cluster.YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED!
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/threadDump,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/job/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/job,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/json,null}
15/01/05 17:22:38 INFO handler.ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs,null}
15/01/05 17:22:38 INFO ui.SparkUI: Stopped Spark web UI at http://yyy2:4040
15/01/05 17:22:38 INFO scheduler.DAGScheduler: Stopping DAGScheduler
15/01/05 17:22:38 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
15/01/05 17:22:38 INFO cluster.YarnClientSchedulerBackend: Asking each executor to shut down
15/01/05 17:22:38 INFO cluster.YarnClientSchedulerBackend: Stopped
15/01/05 17:22:39 INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!
15/01/05 17:22:39 INFO storage.MemoryStore: MemoryStore cleared
15/01/05 17:22:39 INFO storage.BlockManager: BlockManager stopped
15/01/05 17:22:39 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
15/01/05 17:22:39 INFO spark.SparkContext: Successfully stopped SparkContext
15/01/05 17:22:39 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/01/05 17:22:39 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/01/05 17:22:39 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
15/01/05 17:22:57 INFO cluster.YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
Traceback (most recent call last):
File "/home/nlp/platform/spark-1.2.0-bin-2.5.2/examples/src/main/python/pi.py", line 29, in <module>
sc = SparkContext(appName="PythonPi")
File "/home/nlp/spark/python/pyspark/context.py", line 105, in __init__
conf, jsc)
File "/home/nlp/spark/python/pyspark/context.py", line 153, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/home/nlp/spark/python/pyspark/context.py", line 201, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/home/nlp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 701, in __call__
File "/home/nlp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NullPointerException
at org.apache.spark.SparkContext.<init>(SparkContext.scala:497)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
If you're running this example on Java 8, this may be due to Java 8's excessive memory allocation strategy: https://issues.apache.org/jira/browse/YARN-4714
You can force YARN to ignore this by setting up the following properties in yarn-site.xml
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
Try with deploy mode parameter, like this:
--deploy-mode cluster
I had problem like your, with this parameter it worked.
I experienced a similar problem using spark-submit and yarn-client (I got the same NPE/stacktrace). Tuning down my memory settings did the trick. It seems to fail like this when you try to allot too much memory. I would start by removing the --executor-memory and --driver-memory switches.
I reduced the number of cores in the Advanced spark-env to make it work.
I ran into this issue running (hdp 2.3 spark 1.3.1)
spark-shell
--master yarn-client
--driver-memory 4g
--executor-memory 4g
--executor-cores 1
--num-executors 4
Solution for me was to set the spark config value:
spark.yarn.am.extraJavaOptions=-Dhdp.version=2.3.0.0-2557

Resources