Related
I have an AWS EMR cluster running Spark 2.4.4. I'm running a monthly data conversion process with Pyspark, and have never had an issue with it, but today I'm hitting a new error I've never seen before:
INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted)
Has anyone seen this error and point to why it might be happening?
Here's a larger snippet of my stderr file:
20/10/13 00:33:12 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 502 executors.
20/10/13 00:33:15 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-45-56.ec2.internal to RUNNING
20/10/13 00:33:15 INFO YarnAllocator: Detected updated state Some(Running) for host ip-XXX-31-45-56.ec2.internal
20/10/13 00:33:52 INFO YarnAllocator: Driver requested a total number of 501 executor(s).
20/10/13 00:33:52 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 501 executors.
20/10/13 00:34:09 INFO YarnAllocator: Driver requested a total number of 500 executor(s).
20/10/13 00:34:09 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 500 executors.
20/10/13 00:34:39 INFO YarnAllocator: Driver requested a total number of 499 executor(s).
20/10/13 00:34:39 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 499 executors.
20/10/13 00:34:55 INFO YarnAllocator: Driver requested a total number of 498 executor(s).
20/10/13 00:34:55 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 498 executors.
20/10/13 00:35:15 INFO YarnAllocator: Driver requested a total number of 497 executor(s).
20/10/13 00:35:15 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 497 executors.
20/10/13 00:35:34 INFO YarnAllocator: Driver requested a total number of 496 executor(s).
20/10/13 00:35:34 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 496 executors.
20/10/13 00:35:58 INFO YarnAllocator: Driver requested a total number of 495 executor(s).
20/10/13 00:35:58 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 495 executors.
20/10/13 00:36:07 INFO YarnAllocator: Driver requested a total number of 494 executor(s).
20/10/13 00:36:07 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 494 executors.
20/10/13 00:36:20 INFO YarnAllocator: Driver requested a total number of 493 executor(s).
20/10/13 00:36:20 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 493 executors.
20/10/13 00:36:24 INFO YarnAllocator: Driver requested a total number of 492 executor(s).
20/10/13 00:36:24 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 492 executors.
20/10/13 00:36:31 INFO YarnAllocator: Driver requested a total number of 491 executor(s).
20/10/13 00:36:31 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 491 executors.
20/10/13 00:36:37 INFO YarnAllocator: Driver requested a total number of 490 executor(s).
20/10/13 00:36:37 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 490 executors.
20/10/13 00:36:38 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-45-21.ec2.internal to DECOMMISSIONING
20/10/13 00:36:38 INFO YarnAllocator: Detected updated state Some(Decommissioning(20)) for host ip-XXX-31-45-21.ec2.internal
20/10/13 00:36:38 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-39-56.ec2.internal to DECOMMISSIONED
20/10/13 00:36:38 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-39-56.ec2.internal
20/10/13 00:36:38 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-47-237.ec2.internal to DECOMMISSIONING
20/10/13 00:36:38 INFO YarnAllocator: Detected updated state Some(Decommissioning(20)) for host ip-XXX-31-47-237.ec2.internal
20/10/13 00:36:38 INFO YarnAllocator: Yarn node state updated for host ip-XXX-XX-XX-XXX.ec2.internal to DECOMMISSIONING
20/10/13 00:36:38 INFO YarnAllocator: Detected updated state Some(Decommissioning(20)) for host ip-XXX-XX-XX-XXX.ec2.internal
20/10/13 00:36:38 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-40-98.ec2.internal to DECOMMISSIONING
20/10/13 00:36:38 INFO YarnAllocator: Detected updated state Some(Decommissioning(20)) for host ip-XXX-31-40-98.ec2.internal
20/10/13 00:36:38 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-35-249.ec2.internal to DECOMMISSIONING
20/10/13 00:36:38 INFO YarnAllocator: Detected updated state Some(Decommissioning(20)) for host ip-XXX-31-35-249.ec2.internal
20/10/13 00:36:38 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-36-204.ec2.internal to DECOMMISSIONING
20/10/13 00:36:38 INFO YarnAllocator: Detected updated state Some(Decommissioning(20)) for host ip-XXX-31-36-204.ec2.internal
20/10/13 00:36:38 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-45-56.ec2.internal to DECOMMISSIONING
20/10/13 00:36:38 INFO YarnAllocator: Detected updated state Some(Decommissioning(20)) for host ip-XXX-31-45-56.ec2.internal
20/10/13 00:36:38 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-32-155.ec2.internal to DECOMMISSIONING
20/10/13 00:36:38 INFO YarnAllocator: Detected updated state Some(Decommissioning(20)) for host ip-XXX-31-32-155.ec2.internal
20/10/13 00:36:40 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-45-21.ec2.internal to DECOMMISSIONED
20/10/13 00:36:40 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-45-21.ec2.internal
20/10/13 00:36:40 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-47-237.ec2.internal to DECOMMISSIONED
20/10/13 00:36:40 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-47-237.ec2.internal
20/10/13 00:36:40 INFO YarnAllocator: Yarn node state updated for host ip-XXX-XX-XX-XXX.ec2.internal to DECOMMISSIONED
20/10/13 00:36:40 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-XX-XX-XXX.ec2.internal
20/10/13 00:36:40 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-40-98.ec2.internal to DECOMMISSIONED
20/10/13 00:36:40 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-40-98.ec2.internal
20/10/13 00:36:40 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-35-249.ec2.internal to DECOMMISSIONED
20/10/13 00:36:40 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-35-249.ec2.internal
20/10/13 00:36:40 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-36-204.ec2.internal to DECOMMISSIONED
20/10/13 00:36:40 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-36-204.ec2.internal
20/10/13 00:36:40 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-45-56.ec2.internal to DECOMMISSIONED
20/10/13 00:36:40 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-45-56.ec2.internal
20/10/13 00:36:40 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-32-155.ec2.internal to DECOMMISSIONED
20/10/13 00:36:40 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-32-155.ec2.internal
20/10/13 00:36:42 INFO YarnAllocator: Driver requested a total number of 489 executor(s).
20/10/13 00:36:42 INFO YarnAllocatorBlacklistTracker: adding nodes to YARN application master's blacklist: List(ip-XXX-31-32-155.ec2.internal, ip-XXX-31-35-249.ec2.internal, ip-XXX-31-36-204.ec2.internal, ip-XXX-31-40-98.ec2.internal, ip-XXX-XX-XX-XXX.ec2.internal, ip-XXX-31-45-21.ec2.internal, ip-XXX-31-45-56.ec2.internal, ip-XXX-31-47-237.ec2.internal)
20/10/13 00:36:42 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 489 executors.
20/10/13 00:36:50 INFO YarnAllocator: Driver requested a total number of 488 executor(s).
20/10/13 00:36:50 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 488 executors.
20/10/13 00:37:00 INFO YarnAllocator: Driver requested a total number of 487 executor(s).
20/10/13 00:37:00 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 487 executors.
20/10/13 00:37:07 INFO YarnAllocator: Driver requested a total number of 486 executor(s).
20/10/13 00:37:07 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 486 executors.
20/10/13 00:37:11 INFO YarnAllocator: Driver requested a total number of 485 executor(s).
20/10/13 00:37:11 INFO YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 485 executors.
20/10/13 00:37:17 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-39-64.ec2.internal to DECOMMISSIONED
20/10/13 00:37:17 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-39-64.ec2.internal
20/10/13 00:37:17 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-35-183.ec2.internal to DECOMMISSIONED
20/10/13 00:37:17 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-35-183.ec2.internal
20/10/13 00:37:17 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-32-30.ec2.internal to DECOMMISSIONED
20/10/13 00:37:17 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-32-30.ec2.internal
20/10/13 00:37:17 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-42-148.ec2.internal to DECOMMISSIONED
20/10/13 00:37:17 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-42-148.ec2.internal
20/10/13 00:37:17 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-47-172.ec2.internal to DECOMMISSIONED
20/10/13 00:37:17 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-47-172.ec2.internal
20/10/13 00:37:17 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-45-171.ec2.internal to DECOMMISSIONED
20/10/13 00:37:17 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-45-171.ec2.internal
20/10/13 00:37:17 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-40-251.ec2.internal to DECOMMISSIONED
20/10/13 00:37:17 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-40-251.ec2.internal
20/10/13 00:37:17 INFO YarnAllocator: Yarn node state updated for host ip-XXX-31-47-53.ec2.internal to DECOMMISSIONED
20/10/13 00:37:17 INFO YarnAllocator: Detected updated state Some(Decommissioned) for host ip-XXX-31-47-53.ec2.internal
20/10/13 00:37:17 INFO YarnAllocator: Driver requested a total number of 484 executor(s).
20/10/13 00:37:17 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted)
20/10/13 00:37:17 INFO ShutdownHookManager: Shutdown hook called
we have a 3 node Mesos cluster. Master service was started on machine 1 using below command:
sudo ./bin/mesos-master.sh --ip=machine1-ip --work_dir=/home/mapr/mesos/mesos-1.7.0/build/workDir --zk=zk://machine1-ip:2181/mesos --quorum=1
and agent services on other 2 machines using below command:
sudo ./bin/mesos-agent.sh --containerizers=docker --master=zk://machine1-ip:2181/mesos --work_dir=/home/mapr/mesos/mesos-1.7.0/build/workDir --ip=machine2-ip --no-systemd_enable_support
sudo ./bin/mesos-agent.sh --containerizers=docker --master=zk://machine1-ip:2181/mesos --work_dir=/home/mapr/mesos/mesos-1.7.0/build/workDir --ip=machine3-ip --no-systemd_enable_support
Below property was set in machine1:
export MESOS_NATIVE_JAVA_LIBRARY=/usr/local/lib/libmesos.so
we are trying to run spark job using docker image.
Note that we did not set "SPARK_EXECUTOR_URI" in machine1 because as per our understanding executor is going to run inside docker container and not on slave machine and hence this property is not required.
command used for spark submit is below(from machine 1):
/home/mapr/newSpark/spark-2.4.0-bin-hadoop2.7/bin/spark-submit \
--master mesos://machine1:5050 \
--deploy-mode client \
--class com.learning.spark.WordCount \
--conf spark.mesos.executor.docker.image=mesosphere/spark:2.4.0-2.2.1-3-hadoop-2.7 \
/home/mapr/mesos/wordcount.jar hdfs://machine2:8020/hdfslocation/input.txt hdfs://machine2:8020/hdfslocation/output
we are getting below error on spark submit:
Mesos task log:
I1211 20:27:55.040856 5996 exec.cpp:162] Version: 1.7.0
I1211 20:27:55.064775 6016 exec.cpp:236] Executor registered on agent 44c2e848-cd06-4546-b0e9-15537084df1b-S1
I1211 20:27:55.068828 6018 executor.cpp:130] Registered docker executor on company-i0058.company.co.in
I1211 20:27:55.069756 6016 executor.cpp:186] Starting task 3
/bin/sh: 1: /home/mapr/newSpark/spark-2.4.0-bin-hadoop2.7/./bin/spark-class: not found
I1211 20:27:57.669881 6017 executor.cpp:736] Container exited with status 127
I1211 20:27:58.672829 6019 process.cpp:926] Stopped the socket accept loop
messages on the terminal:
2018-12-11 20:27:49 INFO SparkContext:54 - Running Spark version 2.4.0
2018-12-11 20:27:49 INFO SparkContext:54 - Submitted application: WordCount
2018-12-11 20:27:49 INFO SecurityManager:54 - Changing view acls to: mapr
2018-12-11 20:27:49 INFO SecurityManager:54 - Changing modify acls to: mapr
2018-12-11 20:27:49 INFO SecurityManager:54 - Changing view acls groups to:
2018-12-11 20:27:49 INFO SecurityManager:54 - Changing modify acls groups to:
2018-12-11 20:27:49 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mapr); groups with view permissions: Set(); users with modify permissions: Set(mapr); groups with modify permissions: Set()
2018-12-11 20:27:49 INFO Utils:54 - Successfully started service 'sparkDriver' on port 48069.
2018-12-11 20:27:49 INFO SparkEnv:54 - Registering MapOutputTracker
2018-12-11 20:27:49 INFO SparkEnv:54 - Registering BlockManagerMaster
2018-12-11 20:27:49 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2018-12-11 20:27:49 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2018-12-11 20:27:49 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-3a4afff7-b050-45ba-bb50-c9f4ec5cc031
2018-12-11 20:27:49 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB
2018-12-11 20:27:49 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2018-12-11 20:27:49 INFO log:192 - Logging initialized #3157ms
2018-12-11 20:27:50 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2018-12-11 20:27:50 INFO Server:419 - Started #3273ms
2018-12-11 20:27:50 INFO AbstractConnector:278 - Started ServerConnector#1cfd1875{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-12-11 20:27:50 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040.
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6f0628de{/jobs,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#2b27cc70{/jobs/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6f6a7463{/jobs/job,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#79f227a9{/jobs/job/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6ca320ab{/stages,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#50d68830{/stages/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#1e53135d{/stages/stage,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6754ef00{/stages/stage/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#619bd14c{/stages/pool,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#323e8306{/stages/pool/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#a23a01d{/storage,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#4acf72b6{/storage/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#7561db12{/storage/rdd,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#3301500b{/storage/rdd/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#24b52d3e{/environment,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#15deb1dc{/environment/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#6e9c413e{/executors,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#57a4d5ee{/executors/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#5af5def9{/executors/threadDump,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#3a45c42a{/executors/threadDump/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#36dce7ed{/static,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#4b770e40{/,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#78e16155{/api,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#19868320{/jobs/job/kill,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#50b0bc4c{/stages/stage/kill,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://machine1:4040
2018-12-11 20:27:50 INFO SparkContext:54 - Added JAR file:/home/mapr/mesos/wordcount.jar at spark://machine1:48069/jars/wordcount.jar with timestamp 1544540270193
I1211 20:27:50.557170 7462 sched.cpp:232] Version: 1.7.0
I1211 20:27:50.560644 7454 sched.cpp:336] New master detected at master#machine1:5050
I1211 20:27:50.561132 7454 sched.cpp:356] No credentials provided. Attempting to register without authentication
I1211 20:27:50.571651 7456 sched.cpp:744] Framework registered with 5260e4c8-de1c-4772-b5a7-340480594ef4-0000
2018-12-11 20:27:50 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 56351.
2018-12-11 20:27:50 INFO NettyBlockTransferService:54 - Server created on machine1:56351
2018-12-11 20:27:50 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2018-12-11 20:27:50 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, impetus-i0053.impetus.co.in, 56351, None)
2018-12-11 20:27:50 INFO BlockManagerMasterEndpoint:54 - Registering block manager machine1:56351 with 366.3 MB RAM, BlockManagerId(driver, impetus-i0053.impetus.co.in, 56351, None)
2018-12-11 20:27:50 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, machine1, 56351, None)
2018-12-11 20:27:50 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, machine1, 56351, None)
2018-12-11 20:27:50 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler#73ba6fe6{/metrics/json,null,AVAILABLE,#Spark}
2018-12-11 20:27:50 INFO MesosCoarseGrainedSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
2018-12-11 20:27:51 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 0 is now TASK_STARTING
2018-12-11 20:27:51 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 1 is now TASK_STARTING
2018-12-11 20:27:51 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 288.1 KB, free 366.0 MB)
2018-12-11 20:27:51 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 25.1 KB, free 366.0 MB)
2018-12-11 20:27:51 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on machine1:56351 (size: 25.1 KB, free: 366.3 MB)
2018-12-11 20:27:51 INFO SparkContext:54 - Created broadcast 0 from textFile at WordCount.scala:22
2018-12-11 20:27:52 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-12-11 20:27:52 INFO FileInputFormat:249 - Total input paths to process : 1
2018-12-11 20:27:53 INFO deprecation:1173 - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
2018-12-11 20:27:53 INFO HadoopMapRedCommitProtocol:54 - Using output committer class org.apache.hadoop.mapred.FileOutputCommitter
2018-12-11 20:27:53 INFO FileOutputCommitter:108 - File Output Committer Algorithm version is 1
2018-12-11 20:27:53 INFO SparkContext:54 - Starting job: runJob at SparkHadoopWriter.scala:78
2018-12-11 20:27:53 INFO DAGScheduler:54 - Registering RDD 3 (map at WordCount.scala:24)
2018-12-11 20:27:53 INFO DAGScheduler:54 - Got job 0 (runJob at SparkHadoopWriter.scala:78) with 2 output partitions
2018-12-11 20:27:53 INFO DAGScheduler:54 - Final stage: ResultStage 1 (runJob at SparkHadoopWriter.scala:78)
2018-12-11 20:27:53 INFO DAGScheduler:54 - Parents of final stage: List(ShuffleMapStage 0)
2018-12-11 20:27:53 INFO DAGScheduler:54 - Missing parents: List(ShuffleMapStage 0)
2018-12-11 20:27:53 INFO DAGScheduler:54 - Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:24), which has no missing parents
2018-12-11 20:27:53 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 1 is now TASK_RUNNING
2018-12-11 20:27:53 INFO MemoryStore:54 - Block broadcast_1 stored as values in memory (estimated size 5.0 KB, free 366.0 MB)
2018-12-11 20:27:53 INFO MemoryStore:54 - Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.9 KB, free 366.0 MB)
2018-12-11 20:27:53 INFO BlockManagerInfo:54 - Added broadcast_1_piece0 in memory on machine1:56351 (size: 2.9 KB, free: 366.3 MB)
2018-12-11 20:27:53 INFO SparkContext:54 - Created broadcast 1 from broadcast at DAGScheduler.scala:1161
2018-12-11 20:27:53 INFO DAGScheduler:54 - Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:24) (first 15 tasks are for partitions Vector(0, 1))
2018-12-11 20:27:53 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 2 tasks
2018-12-11 20:27:53 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 0 is now TASK_RUNNING
2018-12-11 20:27:54 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 0 is now TASK_FAILED
2018-12-11 20:27:54 INFO BlockManagerMaster:54 - Removal of executor 0 requested
2018-12-11 20:27:54 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 0
2018-12-11 20:27:54 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 0 from BlockManagerMaster.
2018-12-11 20:27:54 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 1 is now TASK_FAILED
2018-12-11 20:27:54 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 1 from BlockManagerMaster.
2018-12-11 20:27:54 INFO BlockManagerMaster:54 - Removal of executor 1 requested
2018-12-11 20:27:54 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 1
2018-12-11 20:27:54 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 2 is now TASK_STARTING
2018-12-11 20:27:55 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 3 is now TASK_STARTING
2018-12-11 20:27:57 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 2 is now TASK_RUNNING
2018-12-11 20:27:57 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 2 is now TASK_FAILED
2018-12-11 20:27:57 INFO MesosCoarseGrainedSchedulerBackend:54 - Blacklisting Mesos slave b92da3e9-a9c4-422a-babe-c5fb0f33e027-S0 due to too many failures; is Spark installed on it?
2018-12-11 20:27:57 INFO BlockManagerMaster:54 - Removal of executor 2 requested
2018-12-11 20:27:57 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 2
2018-12-11 20:27:57 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 2 from BlockManagerMaster.
2018-12-11 20:27:57 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 3 is now TASK_RUNNING
2018-12-11 20:27:57 INFO MesosCoarseGrainedSchedulerBackend:54 - Mesos task 3 is now TASK_FAILED
2018-12-11 20:27:57 INFO MesosCoarseGrainedSchedulerBackend:54 - Blacklisting Mesos slave 44c2e848-cd06-4546-b0e9-15537084df1b-S1 due to too many failures; is Spark installed on it?
2018-12-11 20:27:57 INFO BlockManagerMaster:54 - Removal of executor 3 requested
2018-12-11 20:27:57 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 3 from BlockManagerMaster.
2018-12-11 20:27:57 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 3
2018-12-11 20:28:08 WARN TaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I have created a spark cluster on dataproc with 1 master and 6 worker node.
On GCP console I can see 6 VMs are running, but I only see 5 nodes on YARN Node Manager UI.
When I ssh into that machine, from the yarn-yarn-nodemanager log, I see, it keeps restarting and reconnecting to NodeManager.
How can I make this node rejoin cluster ?
update: my command
gcloud dataproc clusters create ${GCS_CLUSTER} \
--image pyspark-with-conda \
--bucket test-spark-data \
--zone asia-east1-b \
--master-boot-disk-size 500GB \
--master-machine-type n1-standard-2 \
--num-masters 1 \
--num-workers 2 \
--worker-machine-type n1-standard-8 \
--num-preemptible-workers 4 \
--preemptible-worker-boot-disk-size 500GB
Error messageļ¼
2018-08-22 08:25:24,801 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at test-spark-cluster-m/10.140.0.34:8031
2018-08-22 08:25:24,836 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0 NM container statuses: []
2018-08-22 08:25:24,843 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers :[]
2018-08-22 08:25:24,978 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Unexpected error starting NodeStatusUpdater
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
2018-08-22 08:25:24,979 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apach
e.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.test-155104.internal,
Sending SHUTDOWN signal to the NodeManager.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeM
anager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:258)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.pv
max-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
... 6 more
2018-08-22 08:25:25,081 INFO org.apache.hadoop.service.AbstractService: Service NodeManager failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recie
ved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeM
anager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:258)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.pv
max-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
... 6 more
2018-08-22 08:25:25,084 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup#0.0.0.0:8042
2018-08-22 08:25:25,185 INFO org.apache.hadoop.ipc.Server: Stopping server on 60144
2018-08-22 08:25:25,186 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 60144
2018-08-22 08:25:25,186 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-08-22 08:25:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting.
2018-08-22 08:25:25,204 INFO org.apache.hadoop.ipc.Server: Stopping server on 8040
2018-08-22 08:25:25,204 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8040
2018-08-22 08:25:25,205 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Public cache exiting
2018-08-22 08:25:25,205 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2018-08-22 08:25:25,205 WARN org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl is interrupted. Exiting.
2018-08-22 08:25:25,205 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NodeManager metrics system...
2018-08-22 08:25:25,206 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system stopped.
2018-08-22 08:25:25,206 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete.
2018-08-22 08:25:25,206 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeM
anager from test-spark-cluster-sw-kbvq.c.test-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:258)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:454)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:837)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:897)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager, Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from test-spark-cluster-sw-kbvq.c.pv
max-155104.internal, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:374)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:252)
... 6 more
2018-08-22 08:25:25,208 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: SHUTDOWN_MSG:
/************************************************************
OP confirmed that this issue is resolved and they didn't encounter it anymore.
This happens even when the nodes are in the same subnet.
I am using the Docker-Flink project in:
https://github.com/apache/flink/tree/master/flink-contrib/docker-flink
I am creating the services with the following commands:
docker network create -d overlay overlay
docker service create --name jobmanager --env JOB_MANAGER_RPC_ADDRESS=jobmanager -p 8081:8081 --network overlay --constraint 'node.hostname == ubuntu-swarm-manager' flink jobmanager
docker service create --name taskmanager --env JOB_MANAGER_RPC_ADDRESS=jobmanager --network overlay --constraint 'node.hostname != ubuntu-swarm-manager' flink taskmanager
This is the error I get:
- Trying to register at JobManager akka.tcp://flink#jobmanager:6123/ user/jobmanager (attempt 4, timeout: 4000 milliseconds)
These are my environment configurations:
node: ubuntu-swarm-master Azure VM Standard D4s v3 (4 vcpus, 16 GB
memory) Docker version 17.03.1-ce, build c6d412e
node: azure-swarm-worker-1 Azure VM Standard D2 v2 Promo (2 vcpus, 7
GB memory) Docker version 17.09.0-ce, build afdb6d4
Flink: using image 1.3.2-hadoop2-scala_2.10
This is from the log of the container running TaskManager:
Starts ok...
Starting Task Manager
config file:
jobmanager.rpc.address: jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 2
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
blob.server.port: 6124
query.server.port: 6125
Starting taskmanager as a console application on host 00afd4130a94.
Then there are some errors (scroll right):
2017-11-02 14:06:51,064 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the leading JobManager.
2017-11-02 14:06:51,065 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
2017-11-02 14:06:51,067 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address jobmanager/10.0.0.2:6123.
2017-11-02 14:06:54,578 INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/10.0.0.2:6123
2017-11-02 14:06:54,779 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '00afd4130a94/10.0.0.5': connect timed out
2017-11-02 14:06:54,829 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:54,880 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:54,931 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:06:54,981 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:55,031 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:55,032 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:06:56,034 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:06:57,036 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:58,037 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:58,038 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:06:58,138 INFO org.apache.flink.runtime.net.ConnectionUtils - Trying to connect to address jobmanager/10.0.0.2:6123
2017-11-02 14:06:58,339 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '00afd4130a94/10.0.0.5': connect timed out
2017-11-02 14:06:58,389 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:58,439 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:58,490 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:06:58,541 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:06:58,592 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:06:58,592 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:06:59,593 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/172.18.0.3': connect timed out
2017-11-02 14:07:00,595 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.5': connect timed out
2017-11-02 14:07:01,599 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/10.0.0.4': connect timed out
2017-11-02 14:07:01,599 INFO org.apache.flink.runtime.net.ConnectionUtils - Failed to connect from address '/127.0.0.1': Invalid argument (connect failed)
2017-11-02 14:07:01,600 WARN org.apache.flink.runtime.net.ConnectionUtils - Could not connect to jobmanager/10.0.0.2:6123. Selecting a local address using heuristics.
2017-11-02 14:07:01,601 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will use hostname/address '00afd4130a94' (10.0.0.5) for communication.
2017-11-02 14:07:01,601 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager
2017-11-02 14:07:01,601 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor system at 00afd4130a94:0.
2017-11-02 14:07:01,947 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2017-11-02 14:07:01,978 INFO Remoting - Starting remoting
2017-11-02 14:07:02,168 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink#00afd4130a94:33881]
2017-11-02 14:07:02,174 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor
2017-11-02 14:07:02,192 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: 00afd4130a94/10.0.0.5, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 2 (manual), number of client threads: 2 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
2017-11-02 14:07:02,199 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms
2017-11-02 14:07:02,201 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary file directory '/tmp': total 29 GB, usable 25 GB (86.21% usable)
2017-11-02 14:07:02,286 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 101 MB for network buffer pool (number of memory segments: 3260, bytes per segment: 32768).
2017-11-02 14:07:02,393 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the network environment and its components.
2017-11-02 14:07:02,400 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 2 ms).
2017-11-02 14:07:02,434 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 32 ms). Listening on SocketAddress /10.0.0.5:42921.
2017-11-02 14:07:02,493 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting managed memory to 0.7 of the currently free heap space (640 MB), memory will be allocated lazily.
2017-11-02 14:07:02,498 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-e57d51fa-2269-4df0-9910-0fe26c6042bd for spill files.
2017-11-02 14:07:02,501 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported.
2017-11-02 14:07:02,553 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-2c0c063f-464e-48f1-9fb8-fcfa48868e3a
2017-11-02 14:07:02,564 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-0c5e2b25-70a2-4964-9eec-24b0e79d560e
2017-11-02 14:07:02,572 INFO org.apache.flink.runtime.taskmanager.TaskManager - Starting TaskManager actor at akka://flink/user/taskmanager#1719715507.
2017-11-02 14:07:02,572 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data connection information: df5992297d269fa16a5e945e1dce0451 # 00afd4130a94 (dataPort=42921)
2017-11-02 14:07:02,573 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager has 2 task slot(s).
2017-11-02 14:07:02,574 INFO org.apache.flink.runtime.taskmanager.TaskManager - Memory usage stats: [HEAP: 113/1024/1024 MB, NON HEAP: 33/33/-1 MB (used/committed/max)]
2017-11-02 14:07:02,576 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2017-11-02 14:07:03,106 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager (attempt 2, timeout: 1000 milliseconds)
2017-11-02 14:07:04,126 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager (attempt 3, timeout: 2000 milliseconds)
Here is the log from the container running JobManager:
Starting Job Manager
config file:
jobmanager.rpc.address: jobmanager
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 1024
taskmanager.numberOfTaskSlots: 1
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
blob.server.port: 6124
query.server.port: 6125
Starting jobmanager as a console application on host c30e0fe7b765.
2017-11-02 13:42:33,721 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager (Version: 1.3.2, Rev:0399bee, Date:03.08.2017 # 10:23:11 UTC)
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Current user: flink
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.141-b15
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - Maximum heap size: 981 MiBytes
2017-11-02 13:42:33,796 INFO org.apache.flink.runtime.jobmanager.JobManager - JAVA_HOME: /docker-java-home/jre
2017-11-02 13:42:33,799 INFO org.apache.flink.runtime.jobmanager.JobManager - Hadoop version: 2.7.2
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - JVM Options:
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xms1024m
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Xmx1024m
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - Program Arguments:
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - --configDir
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - /opt/flink/conf
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - --executionMode
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - cluster
2017-11-02 13:42:33,800 INFO org.apache.flink.runtime.jobmanager.JobManager - Classpath: /opt/flink/lib/flink-python_2.11-1.3.2.jar:/opt/flink/lib/flink-shaded-hadoop2-uber-1.3.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.7.jar:/opt/flink/lib/flink-dist_2.11-1.3.2.jar:::
2017-11-02 13:42:33,801 INFO org.apache.flink.runtime.jobmanager.JobManager - --------------------------------------------------------------------------------
2017-11-02 13:42:33,801 INFO org.apache.flink.runtime.jobmanager.JobManager - Registered UNIX signal handlers for [TERM, HUP, INT]
2017-11-02 13:42:33,911 INFO org.apache.flink.runtime.jobmanager.JobManager - Loading configuration from /opt/flink/conf
2017-11-02 13:42:33,914 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, jobmanager
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 1024
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 1024
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2017-11-02 13:42:33,915 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false
2017-11-02 13:42:33,916 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2017-11-02 13:42:33,916 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081
2017-11-02 13:42:33,917 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 6124
2017-11-02 13:42:33,917 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: query.server.port, 6125
2017-11-02 13:42:33,924 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager without high-availability
2017-11-02 13:42:33,926 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager on jobmanager:6123 with execution mode CLUSTER
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, jobmanager
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 1024
2017-11-02 13:42:33,934 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 1024
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.preallocate, false
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2017-11-02 13:42:33,935 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.web.port, 8081
2017-11-02 13:42:33,936 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 6124
2017-11-02 13:42:33,936 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: query.server.port, 6125
2017-11-02 13:42:33,962 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to flink (auth:SIMPLE)
2017-11-02 13:42:34,026 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor system reachable at jobmanager:6123
2017-11-02 13:42:34,290 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2017-11-02 13:42:34,327 INFO Remoting - Starting remoting
2017-11-02 13:42:34,505 INFO Remoting - Remoting started; listening on addresses :[akka.tcp://flink#jobmanager:6123]
2017-11-02 13:42:34,524 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager web frontend
2017-11-02 13:42:34,532 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - Log file environment variable 'log.file' is not set.
2017-11-02 13:42:34,532 WARN org.apache.flink.runtime.webmonitor.WebMonitorUtils - JobManager log files are unavailable in the web dashboard. Log file location not found in environment variable 'log.file' or configuration key 'jobmanager.web.log.path'.
2017-11-02 13:42:34,532 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using directory /tmp/flink-web-9f0ba581-3488-4086-a79c-53e17b56352c for the web interface files
2017-11-02 13:42:34,533 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using directory /tmp/flink-web-17a58ccf-7d8b-475e-b727-4a7935a19c0f for web frontend JAR file uploads
2017-11-02 13:42:34,741 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web frontend listening at 0:0:0:0:0:0:0:0:8081
2017-11-02 13:42:34,741 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager actor
2017-11-02 13:42:34,751 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /tmp/blobStore-d10b620a-73ae-40af-bd23-aad5211fe1cc
2017-11-02 13:42:34,752 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:6124 - max concurrent requests: 50 - max backlog: 1000
2017-11-02 13:42:34,763 INFO org.apache.flink.runtime.metrics.MetricRegistry - No metrics reporter configured, no metrics will be exposed/reported.
2017-11-02 13:42:34,769 INFO org.apache.flink.runtime.jobmanager.MemoryArchivist - Started memory archivist akka://flink/user/archive
2017-11-02 13:42:34,774 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager on port 8081
2017-11-02 13:42:34,774 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#jobmanager:6123/user/jobmanager:00000000-0000-0000-0000-000000000000.
2017-11-02 13:42:34,776 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink#jobmanager:6123/user/jobmanager.
2017-11-02 13:42:34,785 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Trying to associate with JobManager leader akka.tcp://flink#jobmanager:6123/user/jobmanager
2017-11-02 13:42:34,801 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://flink#jobmanager:6123/user/jobmanager was granted leadership with leader session ID Some(00000000-0000-0000-0000-000000000000).
2017-11-02 13:42:34,814 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Resource Manager associating with leading JobManager Actor[akka://flink/user/jobmanager#844712453] - leader session 00000000-0000-0000-0000-000000000000
Why can't the TaskManagers talk to JobManager? I wonder if there's some configuration missing. Any help will be much appreciated. Thank you very much!
I am trying graphx with live journal data, https://snap.stanford.edu/data/soc-LiveJournal1.html.
I have a cluster of 10 computing nodes. Each computing node has 64G RAM and 32 cores.
When I run pagerank algorithm using 9 worker nodes, it's slower than running it using just 1 woker node. I suspect I am not utilizing all the memory and/or cores due to some configuration issues.
I went through configuration, tuning and programming guide for spark.
I am using spark-shell to run the script which invoke by
./spark-shell --executor-memory 50g
I had the workers and master running. when I start the spark-shell I get the following logs
14/07/09 17:26:10 INFO Slf4jLogger: Slf4jLogger started
14/07/09 17:26:10 INFO Remoting: Starting remoting
14/07/09 17:26:10 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark#node0472.local:60035]
14/07/09 17:26:10 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark#node0472.local:60035]
14/07/09 17:26:10 INFO SparkEnv: Registering MapOutputTracker
14/07/09 17:26:10 INFO SparkEnv: Registering BlockManagerMaster
14/07/09 17:26:10 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140709172610-7f5e
14/07/09 17:26:10 INFO MemoryStore: MemoryStore started with capacity 294.4 MB.
14/07/09 17:26:10 INFO ConnectionManager: Bound socket to port 45700 with id = ConnectionManagerId(node0472.local,45700)
14/07/09 17:26:10 INFO BlockManagerMaster: Trying to register BlockManager
14/07/09 17:26:10 INFO BlockManagerInfo: Registering block manager node0472.local:45700 with 294.4 MB RAM
14/07/09 17:26:10 INFO BlockManagerMaster: Registered BlockManager
14/07/09 17:26:10 INFO HttpServer: Starting HTTP Server
14/07/09 17:26:10 INFO HttpBroadcast: Broadcast server started at http://172.16.104.72:48116
14/07/09 17:26:10 INFO HttpFileServer: HTTP File server directory is /tmp/spark-7b4a7c3c-9fc9-4a64-b2ac-5f328abe9265
14/07/09 17:26:10 INFO HttpServer: Starting HTTP Server
14/07/09 17:26:11 INFO SparkUI: Started SparkUI at http://node0472.local:4040
14/07/09 17:26:12 INFO AppClient$ClientActor: Connecting to master spark://node0472.local:7077...
14/07/09 17:26:12 INFO SparkILoop: Created spark context..
14/07/09 17:26:12 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140709172612-0007
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor added: app-20140709172612-0007/0 on worker-20140709162149-node0476.local-53728 (node0476.local:53728) with 32 cores
14/07/09 17:26:12 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709172612-0007/0 on hostPort node0476.local:53728 with 32 cores, 50.0 GB RAM
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor added: app-20140709172612-0007/1 on worker-20140709162145-node0475.local-56009 (node0475.local:56009) with 32 cores
14/07/09 17:26:12 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709172612-0007/1 on hostPort node0475.local:56009 with 32 cores, 50.0 GB RAM
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor added: app-20140709172612-0007/2 on worker-20140709162141-node0474.local-58108 (node0474.local:58108) with 32 cores
14/07/09 17:26:12 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709172612-0007/2 on hostPort node0474.local:58108 with 32 cores, 50.0 GB RAM
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor added: app-20140709172612-0007/3 on worker-20140709170011-node0480.local-49021 (node0480.local:49021) with 32 cores
14/07/09 17:26:12 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709172612-0007/3 on hostPort node0480.local:49021 with 32 cores, 50.0 GB RAM
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor added: app-20140709172612-0007/4 on worker-20140709165929-node0479.local-53886 (node0479.local:53886) with 32 cores
14/07/09 17:26:12 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709172612-0007/4 on hostPort node0479.local:53886 with 32 cores, 50.0 GB RAM
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor added: app-20140709172612-0007/5 on worker-20140709170036-node0481.local-60958 (node0481.local:60958) with 32 cores
14/07/09 17:26:12 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709172612-0007/5 on hostPort node0481.local:60958 with 32 cores, 50.0 GB RAM
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor added: app-20140709172612-0007/6 on worker-20140709162151-node0477.local-44550 (node0477.local:44550) with 32 cores
14/07/09 17:26:12 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709172612-0007/6 on hostPort node0477.local:44550 with 32 cores, 50.0 GB RAM
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor added: app-20140709172612-0007/7 on worker-20140709162138-node0473.local-42025 (node0473.local:42025) with 32 cores
14/07/09 17:26:12 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709172612-0007/7 on hostPort node0473.local:42025 with 32 cores, 50.0 GB RAM
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor added: app-20140709172612-0007/8 on worker-20140709162156-node0478.local-52943 (node0478.local:52943) with 32 cores
14/07/09 17:26:12 INFO SparkDeploySchedulerBackend: Granted executor ID app-20140709172612-0007/8 on hostPort node0478.local:52943 with 32 cores, 50.0 GB RAM
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor updated: app-20140709172612-0007/1 is now RUNNING
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor updated: app-20140709172612-0007/0 is now RUNNING
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor updated: app-20140709172612-0007/2 is now RUNNING
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor updated: app-20140709172612-0007/3 is now RUNNING
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor updated: app-20140709172612-0007/6 is now RUNNING
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor updated: app-20140709172612-0007/4 is now RUNNING
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor updated: app-20140709172612-0007/5 is now RUNNING
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor updated: app-20140709172612-0007/8 is now RUNNING
14/07/09 17:26:12 INFO AppClient$ClientActor: Executor updated: app-20140709172612-0007/7 is now RUNNING
Spark context available as sc.
scala> 14/07/09 17:26:18 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#node0479.local:47343/user/Executor#1253632521] with ID 4
14/07/09 17:26:18 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#node0474.local:39431/user/Executor#1607018658] with ID 2
14/07/09 17:26:18 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#node0481.local:53722/user/Executor#-1846270627] with ID 5
14/07/09 17:26:18 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#node0477.local:40185/user/Executor#-111495591] with ID 6
14/07/09 17:26:18 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#node0473.local:36426/user/Executor#652192289] with ID 7
14/07/09 17:26:18 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#node0480.local:37230/user/Executor#-1581927012] with ID 3
14/07/09 17:26:18 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#node0475.local:46363/user/Executor#-182973444] with ID 1
14/07/09 17:26:18 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#node0476.local:58053/user/Executor#609775393] with ID 0
14/07/09 17:26:18 INFO SparkDeploySchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor#node0478.local:55152/user/Executor#-2126598605] with ID 8
14/07/09 17:26:19 INFO BlockManagerInfo: Registering block manager node0474.local:60025 with 28.8 GB RAM
14/07/09 17:26:19 INFO BlockManagerInfo: Registering block manager node0473.local:33992 with 28.8 GB RAM
14/07/09 17:26:19 INFO BlockManagerInfo: Registering block manager node0481.local:46513 with 28.8 GB RAM
14/07/09 17:26:19 INFO BlockManagerInfo: Registering block manager node0477.local:37455 with 28.8 GB RAM
14/07/09 17:26:19 INFO BlockManagerInfo: Registering block manager node0475.local:33829 with 28.8 GB RAM
14/07/09 17:26:19 INFO BlockManagerInfo: Registering block manager node0479.local:56433 with 28.8 GB RAM
14/07/09 17:26:19 INFO BlockManagerInfo: Registering block manager node0480.local:38134 with 28.8 GB RAM
14/07/09 17:26:19 INFO BlockManagerInfo: Registering block manager node0476.local:46284 with 28.8 GB RAM
14/07/09 17:26:19 INFO BlockManagerInfo: Registering block manager node0478.local:43187 with 28.8 GB RAM
According to logs, I believe my application was registered on workers and each executor had 50g of RAM. Now, I run the following scala code on my terminal to load data and compute pagerank
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val startgraphloading = System.currentTimeMillis;
val graph = GraphLoader.edgeListFile(sc, "filepath").cache();
graph.cache();
val endgraphloading = System.currentTimeMillis;
val startpr1 = System.currentTimeMillis;
val prGraph = graph.staticPageRank(1)
val endpr1 = System.currentTimeMillis;
val startpr2 = System.currentTimeMillis;
val prGraph = graph.staticPageRank(5)
val endpr2 = System.currentTimeMillis;
val loadingt = endgraphloading - startgraphloading;
val firstt = endpr1 - startpr1
val secondt = endpr2 - startpr2
print(loadingt)
print(firstt)
print(secondt)
When I try to see memory usage on every node, Only 2-3 computing node RAM is actually being used. Is it correct? It runs faster with only 1 worker than with 9 workers.
I am using spark stand-alone cluster mode. Is there any issue with configuration?
Thanks in advance :)
I figured a problem with this after looking at spark code. It was an issue in my script where I am using graphx.
val graph = GraphLoader.edgeListFile(sc, "filepath").cache();
When I looked at the constructor of edgeListFile it says minPartition=1. I thought it's a minimum partition but it's the partition size you want. I set it to number of nodes i.e. partitions I want, and done. Another thing to take care is, as mentioned in graphx programming guide, if you haven't built spark 1.0 from main branch. You should use your own partitionBy function. If graph is not partitioned properly, it'll cause some issues.
It took me a while to know this, Hope this info saves someone's time :)