Spark executor lost when increasing the number of executor instances - apache-spark

My Hadoop cluster currently has 4 nodes and 45 cores running pyspark 2.4 through YARN. When I run spark-submit with one executor everything works fine, but if I change the number of executor-instances to 3 or 4 the executor is killed by the driver and only one task is working.
I have changed the below settings on Cloudera manager:
yarn.nodemanager.resource.memory-mb : 64 GB
yarn.nodemanager.resource.cpu-vcores:45
And below is the log that I get:
19/03/21 11:28:48 INFO cluster.YarnScheduler: Adding task set 0.0 with 1 tasks
19/03/21 11:28:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, datanode1, executor 2, partition 0, PROCESS_LOCAL, 7701 bytes)
19/03/21 11:28:48 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on datanode1:42432 (size: 71.0 KB, free: 366.2 MB)
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 1, 3
19/03/21 11:29:43 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 1, 3
19/03/21 11:29:43 INFO cluster.YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 1, 3
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 2)
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 1)
19/03/21 11:29:45 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
19/03/21 11:29:45 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch 0)
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, datanode2, 32853, None)
19/03/21 11:29:45 INFO storage.BlockManagerMaster: Removed 3 successfully in removeExecutor
19/03/21 11:29:45 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1.
19/03/21 11:29:45 INFO scheduler.DAGScheduler: Executor lost: 1 (epoch 0)
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, datanode3, 39466, None)
19/03/21 11:29:45 INFO storage.BlockManagerMaster: Removed 1 successfully in removeExecutor
19/03/21 11:29:45 INFO cluster.YarnScheduler: Executor 3 on datanode2 killed by driver.
19/03/21 11:29:45 INFO cluster.YarnScheduler: Executor 1 on datanode3 killed by driver.
19/03/21 11:29:45 INFO spark.ExecutorAllocationManager: Existing executor 3 has been removed (new total is 2)
19/03/21 11:29:45 INFO spark.ExecutorAllocationManager: Existing executor 1 has been removed (new total is 1)

Related

can driver program and cluster manager(resource manager) can available on same machine in spark stand alone

I'm doing spark submit from the same machine as spark master, using following command ./bin/spark-submit --master spark://ip:port --deploy-mode "client" test.py I'm my application running forever with following kind of output
22/11/18 13:17:37 INFO BlockManagerMaster: Removal of executor 8 requested
22/11/18 13:17:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 8
22/11/18 13:17:37 INFO StandaloneSchedulerBackend: Granted executor ID app-20221118131723-0008/10 on hostPort 192.168.210.94:37443 with 2 core(s), 1024.0 MiB RAM
22/11/18 13:17:37 INFO BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster.
22/11/18 13:17:37 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221118131723-0008/10 is now RUNNING
22/11/18 13:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221118131723-0008/9 is now EXITED (Command exited with code 1)
22/11/18 13:17:38 INFO StandaloneSchedulerBackend: Executor app-20221118131723-0008/9 removed: Command exited with code 1
22/11/18 13:17:38 INFO BlockManagerMaster: Removal of executor 9 requested
22/11/18 13:17:38 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 9
22/11/18 13:17:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 9 from BlockManagerMaster.
22/11/18 13:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221118131723-0008/11 on worker-20221118111836-192.168.210.82-46395 (192.168.210.82:4639
But when I run from other nodes, my application is running successfully what could be reason?

Spark removing executors in DataProc clusters

During a Spark job execution I'm constantly facing this following issue. Even the DataProc cluster was idle Spark driver is killing all the executors and I end up with one executor.
18/02/26 08:47:05 INFO spark.ExecutorAllocationManager: Existing executor 35 has been removed (new total is 2)
18/02/26 08:50:40 INFO scheduler.TaskSetManager: Finished task 189.0 in stage 5.0 (TID 6002) in 569184 ms on dse-dev-dataproc-w-53.c.k-ddh-lle.internal (executor 57) (499/500)
18/02/26 08:51:40 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 57
18/02/26 08:51:40 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 57
18/02/26 08:51:40 INFO cluster.YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 57
18/02/26 08:51:40 INFO spark.ExecutorAllocationManager: Removing executor 57 because it has been idle for 60 seconds (new desired total will be 1)
18/02/26 08:51:42 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 57.
18/02/26 08:51:42 INFO scheduler.DAGScheduler: Executor lost: 57 (epoch 7)
18/02/26 08:51:42 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 57 from BlockManagerMaster.
18/02/26 08:51:42 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(57, dse-dev-dataproc-w-53.c.k-ddh-lle.internal, 53072, None)
18/02/26 08:51:42 INFO storage.BlockManagerMaster: Removed 57 successfully in removeExecutor
18/02/26 08:51:42 INFO cluster.YarnScheduler: Executor 57 on dse-dev-dataproc-w-53.c.k-ddh-lle.internal killed by driver.
18/02/26 08:51:42 INFO spark.ExecutorAllocationManager: Existing executor 57 has been removed (new total is 1)
Once the above task is completed when it starts the next Stage it looks for the shuffled data.
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 5 to 10.206.52.190:42676
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 7 to 10.206.52.157:45812
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 7 to 10.206.52.177:53612
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 5 to 10.206.52.166:41901
This window will hang for sometime and the tasks fail with the following exception.
18/02/26 09:12:33 INFO BlockManagerMaster: Removal of executor 21 requested
18/02/26 09:12:33 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
18/02/26 00:12:33 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
18/02/26 09:12:33 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1480517110174_0001_01_000049 on host: ip-10-138-114-125.ec2.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1480517110174_0001_01_000049
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My Job executor settings:
spark.driver.maxResultSize 3840m
spark.driver.memory 16G
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.maxExecutors 100
spark.dynamicAllocation.minExecutors 1
spark.executor.cores 6
spark.executor.id driver
spark.executor.instances 20
spark.executor.memory 30G
spark.hadoop.yarn.timeline-service.enabled False
spark.shuffle.service.enabled true
spark.sql.catalogImplementation hive
spark.sql.parquet.cacheMetadata false

Why would Spark executors be removed (with "ExecutorAllocationManager: Request to remove executorIds" in the logs)?

Im trying to execute a spark job in an AWS cluster of 6 c4.2xlarge nodes and I don't know why Spark is killing the executors...
Any help will be appreciated
Here the spark submit command:
. /usr/bin/spark-submit --packages="com.databricks:spark-avro_2.11:3.2.0" --jars RedshiftJDBC42-1.2.1.1001.jar --deploy-mode client --master yarn --num-executors 12 --executor-cores 3 --executor-memory 7G --driver-memory 7g --py-files dependencies.zip iface_extractions.py 2016-10-01 > output.log
At line this line starts to remove executors
17/05/25 14:42:50 INFO ExecutorAllocationManager: Request to remove executorIds: 5, 3
Output spark-submit log:
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-avro_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-avro_2.11;3.2.0 in central
found org.slf4j#slf4j-api;1.7.5 in central
found org.apache.avro#avro;1.7.6 in central
found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in central
found com.thoughtworks.paranamer#paranamer;2.3 in central
found org.xerial.snappy#snappy-java;1.0.5 in central
found org.apache.commons#commons-compress;1.4.1 in central
found org.tukaani#xz;1.0 in central
:: resolution report :: resolve 284ms :: artifacts dl 8ms
:: modules in use:
com.databricks#spark-avro_2.11;3.2.0 from central in [default]
com.thoughtworks.paranamer#paranamer;2.3 from central in [default]
org.apache.avro#avro;1.7.6 from central in [default]
org.apache.commons#commons-compress;1.4.1 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
org.slf4j#slf4j-api;1.7.5 from central in [default]
org.tukaani#xz;1.0 from central in [default]
org.xerial.snappy#snappy-java;1.0.5 from central in [default]
:: evicted modules:
org.slf4j#slf4j-api;1.6.4 by [org.slf4j#slf4j-api;1.7.5] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 10 | 0 | 0 | 1 || 9 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 9 already retrieved (0kB/8ms)
17/05/25 14:41:37 INFO SparkContext: Running Spark version 2.1.0
17/05/25 14:41:38 INFO SecurityManager: Changing view acls to: hadoop
17/05/25 14:41:38 INFO SecurityManager: Changing modify acls to: hadoop
17/05/25 14:41:38 INFO SecurityManager: Changing view acls groups to:
17/05/25 14:41:38 INFO SecurityManager: Changing modify acls groups to:
17/05/25 14:41:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
17/05/25 14:41:38 INFO Utils: Successfully started service 'sparkDriver' on port 37132.
17/05/25 14:41:38 INFO SparkEnv: Registering MapOutputTracker
17/05/25 14:41:38 INFO SparkEnv: Registering BlockManagerMaster
17/05/25 14:41:38 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/05/25 14:41:38 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/05/25 14:41:38 INFO DiskBlockManager: Created local directory at /mnt/tmp/blockmgr-e368a261-c1a1-49e7-8533-8081896a45e4
17/05/25 14:41:38 INFO MemoryStore: MemoryStore started with capacity 4.0 GB
17/05/25 14:41:38 INFO SparkEnv: Registering OutputCommitCoordinator
17/05/25 14:41:39 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/05/25 14:41:39 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.185.53.161:4040
17/05/25 14:41:39 INFO Utils: Using initial executors = 12, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/05/25 14:41:39 INFO RMProxy: Connecting to ResourceManager at ip-10-185-53-161.eu-west-1.compute.internal/10.185.53.161:8032
17/05/25 14:41:39 INFO Client: Requesting a new application from cluster with 5 NodeManagers
17/05/25 14:41:40 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)
17/05/25 14:41:40 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
17/05/25 14:41:40 INFO Client: Setting up container launch context for our AM
17/05/25 14:41:40 INFO Client: Setting up the launch environment for our AM container
17/05/25 14:41:40 INFO Client: Preparing resources for our AM container
17/05/25 14:41:40 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/05/25 14:41:42 INFO Client: Uploading resource file:/mnt/tmp/spark-4f534fa1-c377-4113-9c86-96d5cdab4cb5/__spark_libs__6500399427935716229.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/__spark_libs__6500399427935716229.zip
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/RedshiftJDBC42-1.2.1.1001.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/RedshiftJDBC42-1.2.1.1001.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/com.databricks_spark-avro_2.11-3.2.0.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/com.databricks_spark-avro_2.11-3.2.0.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.slf4j_slf4j-api-1.7.5.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.avro_avro-1.7.6.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.apache.avro_avro-1.7.6.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.codehaus.jackson_jackson-core-asl-1.9.13.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/com.thoughtworks.paranamer_paranamer-2.3.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.5.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.xerial.snappy_snappy-java-1.0.5.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.apache.commons_commons-compress-1.4.1.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.tukaani_xz-1.0.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.tukaani_xz-1.0.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/hive-site.xml
17/05/25 14:41:43 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/pyspark.zip
17/05/25 14:41:43 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.4-src.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/py4j-0.10.4-src.zip
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/dependencies.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/dependencies.zip
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.databricks_spark-avro_2.11-3.2.0.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.avro_avro-1.7.6.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.5.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.tukaani_xz-1.0.jar added multiple times to distributed cache.
17/05/25 14:41:43 INFO Client: Uploading resource file:/mnt/tmp/spark-4f534fa1-c377-4113-9c86-96d5cdab4cb5/__spark_conf__1516567354161750682.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/__spark_conf__.zip
17/05/25 14:41:43 INFO SecurityManager: Changing view acls to: hadoop
17/05/25 14:41:43 INFO SecurityManager: Changing modify acls to: hadoop
17/05/25 14:41:43 INFO SecurityManager: Changing view acls groups to:
17/05/25 14:41:43 INFO SecurityManager: Changing modify acls groups to:
17/05/25 14:41:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
17/05/25 14:41:43 INFO Client: Submitting application application_1495720658394_0004 to ResourceManager
17/05/25 14:41:43 INFO YarnClientImpl: Submitted application application_1495720658394_0004
17/05/25 14:41:43 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1495720658394_0004 and attemptId None
17/05/25 14:41:44 INFO Client: Application report for application_1495720658394_0004 (state: ACCEPTED)
17/05/25 14:41:44 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1495723303463
final status: UNDEFINED
tracking URL: http://ip-10-185-53-161.eu-west-1.compute.internal:20888/proxy/application_1495720658394_0004/
user: hadoop
17/05/25 14:41:45 INFO Client: Application report for application_1495720658394_0004 (state: ACCEPTED)
17/05/25 14:41:46 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
17/05/25 14:41:46 INFO Client: Application report for application_1495720658394_0004 (state: ACCEPTED)
17/05/25 14:41:46 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> ip-10-185-53-161.eu-west-1.compute.internal, PROXY_URI_BASES -> http://ip-10-185-53-161.eu-west-1.compute.internal:20888/proxy/application_1495720658394_0004), /proxy/application_1495720658394_0004
17/05/25 14:41:46 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
17/05/25 14:41:47 INFO Client: Application report for application_1495720658394_0004 (state: RUNNING)
17/05/25 14:41:47 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.185.52.31
ApplicationMaster RPC port: 0
queue: default
start time: 1495723303463
final status: UNDEFINED
tracking URL: http://ip-10-185-53-161.eu-west-1.compute.internal:20888/proxy/application_1495720658394_0004/
user: hadoop
17/05/25 14:41:47 INFO YarnClientSchedulerBackend: Application application_1495720658394_0004 has started running.
17/05/25 14:41:47 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37860.
17/05/25 14:41:47 INFO NettyBlockTransferService: Server created on 10.185.53.161:37860
17/05/25 14:41:47 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/05/25 14:41:47 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO BlockManagerMasterEndpoint: Registering block manager 10.185.53.161:37860 with 4.0 GB RAM, BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO BlockManager: external shuffle service port = 7337
17/05/25 14:41:47 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1495720658394_0004
17/05/25 14:41:47 INFO Utils: Using initial executors = 12, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.52.31:57406) with ID 5
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 5 has registered (new total is 1)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-52-31.eu-west-1.compute.internal:38781 with 4.0 GB RAM, BlockManagerId(5, ip-10-185-52-31.eu-west-1.compute.internal, 38781, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.45:40096) with ID 3
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 3 has registered (new total is 2)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-45.eu-west-1.compute.internal:43702 with 4.0 GB RAM, BlockManagerId(3, ip-10-185-53-45.eu-west-1.compute.internal, 43702, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.135:42390) with ID 2
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 2 has registered (new total is 3)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-135.eu-west-1.compute.internal:41552 with 4.0 GB RAM, BlockManagerId(2, ip-10-185-53-135.eu-west-1.compute.internal, 41552, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.10:60612) with ID 1
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 4)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-10.eu-west-1.compute.internal:33391 with 4.0 GB RAM, BlockManagerId(1, ip-10-185-53-10.eu-west-1.compute.internal, 33391, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.68:57424) with ID 4
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 4 has registered (new total is 5)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-68.eu-west-1.compute.internal:34222 with 4.0 GB RAM, BlockManagerId(4, ip-10-185-53-68.eu-west-1.compute.internal, 34222, None)
17/05/25 14:42:09 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
17/05/25 14:42:09 INFO SharedState: Warehouse path is 'hdfs:///user/spark/warehouse'.
17/05/25 14:42:10 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
17/05/25 14:42:11 INFO CodeGenerator: Code generated in 170.416763 ms
17/05/25 14:42:11 INFO SparkContext: Starting job: collect at /home/hadoop/iface_extractions/select_fields.py:90
17/05/25 14:42:11 INFO DAGScheduler: Got job 0 (collect at /home/hadoop/iface_extractions/select_fields.py:90) with 1 output partitions
17/05/25 14:42:11 INFO DAGScheduler: Final stage: ResultStage 0 (collect at /home/hadoop/iface_extractions/select_fields.py:90)
17/05/25 14:42:11 INFO DAGScheduler: Parents of final stage: List()
17/05/25 14:42:11 INFO DAGScheduler: Missing parents: List()
17/05/25 14:42:11 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at collect at /home/hadoop/iface_extractions/select_fields.py:90), which has no missing parents
17/05/25 14:42:11 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 7.5 KB, free 4.0 GB)
17/05/25 14:42:11 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.1 KB, free 4.0 GB)
17/05/25 14:42:11 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.185.53.161:37860 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:11 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
17/05/25 14:42:11 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at collect at /home/hadoop/iface_extractions/select_fields.py:90)
17/05/25 14:42:11 INFO YarnScheduler: Adding task set 0.0 with 1 tasks
17/05/25 14:42:11 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-10-185-53-135.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5899 bytes)
17/05/25 14:42:11 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-10-185-53-135.eu-west-1.compute.internal:41552 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1101 ms on ip-10-185-53-135.eu-west-1.compute.internal (executor 2) (1/1)
17/05/25 14:42:12 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/05/25 14:42:12 INFO DAGScheduler: ResultStage 0 (collect at /home/hadoop/iface_extractions/select_fields.py:90) finished in 1.109 s
17/05/25 14:42:12 INFO DAGScheduler: Job 0 finished: collect at /home/hadoop/iface_extractions/select_fields.py:90, took 1.290037 s
17/05/25 14:42:12 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 10.185.53.161:37860 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO SparkContext: Starting job: collect at /home/hadoop/iface_extractions/select_fields.py:91
17/05/25 14:42:12 INFO BlockManagerInfo: Removed broadcast_0_piece0 on ip-10-185-53-135.eu-west-1.compute.internal:41552 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO DAGScheduler: Got job 1 (collect at /home/hadoop/iface_extractions/select_fields.py:91) with 1 output partitions
17/05/25 14:42:12 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /home/hadoop/iface_extractions/select_fields.py:91)
17/05/25 14:42:12 INFO DAGScheduler: Parents of final stage: List()
17/05/25 14:42:12 INFO DAGScheduler: Missing parents: List()
17/05/25 14:42:12 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at collect at /home/hadoop/iface_extractions/select_fields.py:91), which has no missing parents
17/05/25 14:42:12 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.5 KB, free 4.0 GB)
17/05/25 14:42:12 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.1 KB, free 4.0 GB)
17/05/25 14:42:12 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.185.53.161:37860 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996
17/05/25 14:42:12 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at collect at /home/hadoop/iface_extractions/select_fields.py:91)
17/05/25 14:42:12 INFO YarnScheduler: Adding task set 1.0 with 1 tasks
17/05/25 14:42:12 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ip-10-185-53-68.eu-west-1.compute.internal, executor 4, partition 0, PROCESS_LOCAL, 5900 bytes)
17/05/25 14:42:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-10-185-53-68.eu-west-1.compute.internal:34222 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:14 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 1047 ms on ip-10-185-53-68.eu-west-1.compute.internal (executor 4) (1/1)
17/05/25 14:42:14 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/05/25 14:42:14 INFO DAGScheduler: ResultStage 1 (collect at /home/hadoop/iface_extractions/select_fields.py:91) finished in 1.047 s
17/05/25 14:42:14 INFO DAGScheduler: Job 1 finished: collect at /home/hadoop/iface_extractions/select_fields.py:91, took 1.054768 s
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 13.109425 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 12.568665 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 11.257538 ms
17/05/25 14:42:14 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 10.185.53.161:37860 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:14 INFO BlockManagerInfo: Removed broadcast_1_piece0 on ip-10-185-53-68.eu-west-1.compute.internal:34222 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 11.563958 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 18.189301 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 13.490762 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 15.156166 ms
17/05/25 14:42:50 INFO ExecutorAllocationManager: Request to remove executorIds: 5, 3
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 5, 3
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 5, 3
17/05/25 14:42:50 INFO ExecutorAllocationManager: Removing executor 5 because it has been idle for 60 seconds (new desired total will be 4)
17/05/25 14:42:50 INFO ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 3)
17/05/25 14:42:50 INFO ExecutorAllocationManager: Request to remove executorIds: 1
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 1
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 1
17/05/25 14:42:50 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 2)
17/05/25 14:42:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 5.
17/05/25 14:42:50 INFO DAGScheduler: Executor lost: 5 (epoch 0)
17/05/25 14:42:50 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
17/05/25 14:42:50 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-185-52-31.eu-west-1.compute.internal, 38781, None)
17/05/25 14:42:50 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor
17/05/25 14:42:50 INFO YarnScheduler: Executor 5 on ip-10-185-52-31.eu-west-1.compute.internal killed by driver.
17/05/25 14:42:50 INFO ExecutorAllocationManager: Existing executor 5 has been removed (new total is 4)
17/05/25 14:42:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1.
17/05/25 14:42:51 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-10-185-53-10.eu-west-1.compute.internal, 33391, None)
17/05/25 14:42:51 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
17/05/25 14:42:51 INFO YarnScheduler: Executor 1 on ip-10-185-53-10.eu-west-1.compute.internal killed by driver.
17/05/25 14:42:51 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 3)
17/05/25 14:42:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
17/05/25 14:42:51 INFO DAGScheduler: Executor lost: 3 (epoch 0)
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, ip-10-185-53-45.eu-west-1.compute.internal, 43702, None)
17/05/25 14:42:51 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
17/05/25 14:42:51 INFO YarnScheduler: Executor 3 on ip-10-185-53-45.eu-west-1.compute.internal killed by driver.
17/05/25 14:42:51 INFO ExecutorAllocationManager: Existing executor 3 has been removed (new total is 2)
17/05/25 14:43:12 INFO ExecutorAllocationManager: Request to remove executorIds: 2
17/05/25 14:43:12 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 2
17/05/25 14:43:12 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 2
17/05/25 14:43:12 INFO ExecutorAllocationManager: Removing executor 2 because it has been idle for 60 seconds (new desired total will be 1)
17/05/25 14:43:13 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 2.
17/05/25 14:43:13 INFO DAGScheduler: Executor lost: 2 (epoch 0)
17/05/25 14:43:13 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
17/05/25 14:43:13 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, ip-10-185-53-135.eu-west-1.compute.internal, 41552, None)
17/05/25 14:43:13 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
17/05/25 14:43:13 INFO YarnScheduler: Executor 2 on ip-10-185-53-135.eu-west-1.compute.internal killed by driver.
17/05/25 14:43:13 INFO ExecutorAllocationManager: Existing executor 2 has been removed (new total is 1)
17/05/25 14:43:14 INFO ExecutorAllocationManager: Request to remove executorIds: 4
17/05/25 14:43:14 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 4
17/05/25 14:43:14 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 4
17/05/25 14:43:14 INFO ExecutorAllocationManager: Removing executor 4 because it has been idle for 60 seconds (new desired total will be 0)
17/05/25 14:43:17 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 4.
17/05/25 14:43:17 INFO DAGScheduler: Executor lost: 4 (epoch 0)
17/05/25 14:43:17 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
17/05/25 14:43:17 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(4, ip-10-185-53-68.eu-west-1.compute.internal, 34222, None)
17/05/25 14:43:17 INFO BlockManagerMaster: Removed 4 successfully in removeExecutor
17/05/25 14:43:17 INFO YarnScheduler: Executor 4 on ip-10-185-53-68.eu-west-1.compute.internal killed by driver.
17/05/25 14:43:17 INFO ExecutorAllocationManager: Existing executor 4 has been removed (new total is 0)
My guess is that you've got Dynamic Resource Allocation enabled in your Spark configuration.
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
This feature is disabled by default and available on all coarse-grained cluster managers, i.e. standalone mode, YARN mode, and Mesos coarse-grained mode.
I highlighted the relevant part that says it is disabled by default and hence I can only guess that it was enabled.
From ExecutorAllocationManager:
An agent that dynamically allocates and removes executors based on the workload.
With that said, I'd use web UI and see if spark.dynamicAllocation.enabled property is enabled or not.
There are two requirements for using this feature (Dynamic Resource Allocation). First, your application must set spark.dynamicAllocation.enabled to true. Second, you must set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application.
This is the line that prints out the INFO message:
logInfo("Request to remove executorIds: " + executors.mkString(", "))
You can also kill executors using SparkContext.killExecutors that gives a Spark developer a way to kill executors himself.
killExecutors(executorIds: Seq[String]): Boolean Request that the cluster manager kill the specified executors.
There are two killExecutors actually and they are very helpful for demo purposes as you can easily show how executors come and go.

Spark with GPUs: How to force 1 task per executor

I have Spark 2.1.0 running on a cluster with N slave nodes. Each node has 16 cores (8 cores/cpu and 2 cpus) and 1 GPU. I want to use the map process to launch a GPU kernel. Since there is only 1 GPU per node, I need to ensure that two executors are not on the same node (at the same time) trying to use the GPU and that two tasks are not submitted to the same executor at the same time.
How can I force Spark to have one executor per node?
I have tried the following:
--Setting: spark.executor.cores 16 in $SPARK_HOME/conf/spark-defaults.conf
--Setting: SPARK_WORKER_CORES = 16 and SPARK_WORKER_INSTANCES = 1 in $SPARK_HOME/conf/spark-env.sh
and,
--Setting conf = SparkConf().set('spark.executor.cores', 16).set('spark.executor.instances', 6) directly in my spark script (when I wanted N=6 for debugging purposes).
These options create 6 executors on different nodes as desired, but it seems that each task is assigned to the same executor.
Here are some snippets from my most the most recent output (which lead me to believe it should be working as I want).
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/0 on worker-20170217110853-10.128.14.208-35771 (10.128.14.208:35771) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/0 on hostPort 10.128.14.208:35771 with 16 cores, 16.0 GB RAM
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/1 on worker-20170217110853-10.128.9.95-59294 (10.128.9.95:59294) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/1 on hostPort 10.128.9.95:59294 with 16 cores, 16.0 GB RAM
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/2 on worker-20170217110853-10.128.3.71-47507 (10.128.3.71:47507) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/2 on hostPort 10.128.3.71:47507 with 16 cores, 16.0 GB RAM
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/3 on worker-20170217110853-10.128.9.96-50800 (10.128.9.96:50800) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/3 on hostPort 10.128.9.96:50800 with 16 cores, 16.0 GB RAM
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/4 on worker-20170217110853-10.128.3.73-60194 (10.128.3.73:60194) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/4 on hostPort 10.128.3.73:60194 with 16 cores, 16.0 GB RAM
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/5 on worker-20170217110853-10.128.3.74-42793 (10.128.3.74:42793) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/5 on hostPort 10.128.3.74:42793 with 16 cores, 16.0 GB RAM
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/1 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/3 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/4 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/2 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/0 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/5 is now RUNNING
17/02/17 11:09:11 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
My RDD has 6 partitions.
The important thing is that 6 Executors were started, each with a different IP address and each getting 16 cores (exactly what I expected). The phrase My RDD has 6 partitions. is a print statement from my code after repartitioning my RDD (to make sure I had 1 partition per executor).
Then, THIS happens... each of the 6 tasks are sent to the same executor!
17/02/17 11:09:12 INFO TaskSchedulerImpl: Adding task set 0.0 with 6 tasks
17/02/17 11:09:17 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.128.9.95:34059) with ID 1
17/02/17 11:09:17 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.128.9.95, executor 1, partition 0, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.128.9.95, executor 1, partition 1, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.128.9.95, executor 1, partition 2, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.128.9.95, executor 1, partition 3, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.128.9.95, executor 1, partition 4, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.128.9.95, executor 1, partition 5, PROCESS_LOCAL, 6095 bytes)
Why? and How can I fix it? The problem is that at this point, all 6 tasks compete for the same GPU and the GPU cannot be shared.
I tried the suggestion in the comments of Samson Scharfrichter, but they didn't seem to work. However, I found: http://spark.apache.org/docs/latest/configuration.html#scheduling which includes spark.task.cpus. If I set that to 16 and spark.executor.cores to 16 then I appear to get one task assigned to each executor.

Spark metrics on wordcount example

I read the section Metrics on spark website. I wish to try it on the wordcount example, I can't make it work.
spark/conf/metrics.properties :
# Enable CsvSink for all instances
*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
# Polling period for CsvSink
*.sink.csv.period=1
*.sink.csv.unit=seconds
# Polling directory for CsvSink
*.sink.csv.directory=/home/spark/Documents/test/
# Worker instance overlap polling period
worker.sink.csv.period=1
worker.sink.csv.unit=seconds
# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource
I run my app in local like in the documentation :
$SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
I checked /home/spark/Documents/test/ and it is empty.
What did I miss?
Shell:
$SPARK_HOME/bin/spark-submit --class "SimpleApp" --master local[4] --conf spark.metrics.conf=/home/spark/development/spark/conf/metrics.properties target/scala-2.10/simple-project_2.10-1.0.jar
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
INFO SparkContext: Running Spark version 1.3.0
WARN Utils: Your hostname, cv-local resolves to a loopback address: 127.0.1.1; using 192.168.1.64 instead (on interface eth0)
WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
INFO SecurityManager: Changing view acls to: spark
INFO SecurityManager: Changing modify acls to: spark
INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark)
INFO Slf4jLogger: Slf4jLogger started
INFO Remoting: Starting remoting
INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver#cv-local.local:35895]
INFO Utils: Successfully started service 'sparkDriver' on port 35895.
INFO SparkEnv: Registering MapOutputTracker
INFO SparkEnv: Registering BlockManagerMaster
INFO DiskBlockManager: Created local directory at /tmp/spark-447d56c9-cfe5-4f9d-9e0a-6bb476ddede6/blockmgr-4eaa04f4-b4b2-4b05-ba0e-fd1aeb92b289
INFO MemoryStore: MemoryStore started with capacity 265.4 MB
INFO HttpFileServer: HTTP File server directory is /tmp/spark-fae11cd2-937e-4be3-a273-be8b4c4847df/httpd-ca163445-6fff-45e4-9c69-35edcea83b68
INFO HttpServer: Starting HTTP Server
INFO Utils: Successfully started service 'HTTP file server' on port 52828.
INFO SparkEnv: Registering OutputCommitCoordinator
INFO Utils: Successfully started service 'SparkUI' on port 4040.
INFO SparkUI: Started SparkUI at http://cv-local.local:4040
INFO SparkContext: Added JAR file:/home/spark/workspace/IdeaProjects/wordcount/target/scala-2.10/simple-project_2.10-1.0.jar at http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar with timestamp 1444049152348
INFO Executor: Starting executor ID <driver> on host localhost
INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#cv-local.local:35895/user/HeartbeatReceiver
INFO NettyBlockTransferService: Server created on 60320
INFO BlockManagerMaster: Trying to register BlockManager
INFO BlockManagerMasterActor: Registering block manager localhost:60320 with 265.4 MB RAM, BlockManagerId(<driver>, localhost, 60320)
INFO BlockManagerMaster: Registered BlockManager
INFO MemoryStore: ensureFreeSpace(34046) called with curMem=0, maxMem=278302556
INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 33.2 KB, free 265.4 MB)
INFO MemoryStore: ensureFreeSpace(5221) called with curMem=34046, maxMem=278302556
INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.1 KB, free 265.4 MB)
INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:60320 (size: 5.1 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
INFO SparkContext: Created broadcast 0 from textFile at SimpleApp.scala:11
WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
WARN LoadSnappy: Snappy native library not loaded
INFO FileInputFormat: Total input paths to process : 1
INFO SparkContext: Starting job: count at SimpleApp.scala:12
INFO DAGScheduler: Got job 0 (count at SimpleApp.scala:12) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 0(count at SimpleApp.scala:12)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12), which has no missing parents
INFO MemoryStore: ensureFreeSpace(2848) called with curMem=39267, maxMem=278302556
INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.8 KB, free 265.4 MB)
INFO MemoryStore: ensureFreeSpace(2056) called with curMem=42115, maxMem=278302556
INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.0 KB, free 265.4 MB)
INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:60320 (size: 2.0 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839
INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MapPartitionsRDD[2] at filter at SimpleApp.scala:12)
INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1391 bytes)
INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1391 bytes)
INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
INFO Executor: Fetching http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar with timestamp 1444049152348
INFO Utils: Fetching http://192.168.1.64:52828/jars/simple-project_2.10-1.0.jar to /tmp/spark-cab5a940-e2a4-4caf-8549-71e1518271f1/userFiles-c73172c2-7af6-4861-a945-b183edbbafa1/fetchFileTemp4229868141058449157.tmp
INFO Executor: Adding file:/tmp/spark-cab5a940-e2a4-4caf-8549-71e1518271f1/userFiles-c73172c2-7af6-4861-a945-b183edbbafa1/simple-project_2.10-1.0.jar to class loader
INFO CacheManager: Partition rdd_1_1 not found, computing it
INFO CacheManager: Partition rdd_1_0 not found, computing it
INFO HadoopRDD: Input split: file:/home/spark/development/spark/conf/metrics.properties:2659+2659
INFO HadoopRDD: Input split: file:/home/spark/development/spark/conf/metrics.properties:0+2659
INFO MemoryStore: ensureFreeSpace(7840) called with curMem=44171, maxMem=278302556
INFO MemoryStore: Block rdd_1_0 stored as values in memory (estimated size 7.7 KB, free 265.4 MB)
INFO BlockManagerInfo: Added rdd_1_0 in memory on localhost:60320 (size: 7.7 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block rdd_1_0
INFO MemoryStore: ensureFreeSpace(8648) called with curMem=52011, maxMem=278302556
INFO MemoryStore: Block rdd_1_1 stored as values in memory (estimated size 8.4 KB, free 265.4 MB)
INFO BlockManagerInfo: Added rdd_1_1 in memory on localhost:60320 (size: 8.4 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block rdd_1_1
INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2399 bytes result sent to driver
INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2399 bytes result sent to driver
INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 139 ms on localhost (1/2)
INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 133 ms on localhost (2/2)
INFO DAGScheduler: Stage 0 (count at SimpleApp.scala:12) finished in 0.151 s
INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
INFO DAGScheduler: Job 0 finished: count at SimpleApp.scala:12, took 0.225939 s
INFO SparkContext: Starting job: count at SimpleApp.scala:13
INFO DAGScheduler: Got job 1 (count at SimpleApp.scala:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 1(count at SimpleApp.scala:13)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13), which has no missing parents
INFO MemoryStore: ensureFreeSpace(2848) called with curMem=60659, maxMem=278302556
INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.8 KB, free 265.3 MB)
INFO MemoryStore: ensureFreeSpace(2056) called with curMem=63507, maxMem=278302556
INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.0 KB, free 265.3 MB)
INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:60320 (size: 2.0 KB, free: 265.4 MB)
INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
INFO DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[3] at filter at SimpleApp.scala:13)
INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 1391 bytes)
INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 1391 bytes)
INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
INFO BlockManager: Found block rdd_1_0 locally
INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 1830 bytes result sent to driver
INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 9 ms on localhost (1/2)
INFO BlockManager: Found block rdd_1_1 locally
INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 1830 bytes result sent to driver
INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 10 ms on localhost (2/2)
INFO DAGScheduler: Stage 1 (count at SimpleApp.scala:13) finished in 0.011 s
INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
INFO DAGScheduler: Job 1 finished: count at SimpleApp.scala:13, took 0.024084 s
Lines with a: 5, Lines with b: 12
I made it work specifying in the spark submit the path to the metrics file
--files=/yourPath/metrics.properties --conf spark.metrics.conf=./metrics.properties

Resources