Spark removing executors in DataProc clusters - apache-spark

During a Spark job execution I'm constantly facing this following issue. Even the DataProc cluster was idle Spark driver is killing all the executors and I end up with one executor.
18/02/26 08:47:05 INFO spark.ExecutorAllocationManager: Existing executor 35 has been removed (new total is 2)
18/02/26 08:50:40 INFO scheduler.TaskSetManager: Finished task 189.0 in stage 5.0 (TID 6002) in 569184 ms on dse-dev-dataproc-w-53.c.k-ddh-lle.internal (executor 57) (499/500)
18/02/26 08:51:40 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 57
18/02/26 08:51:40 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 57
18/02/26 08:51:40 INFO cluster.YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 57
18/02/26 08:51:40 INFO spark.ExecutorAllocationManager: Removing executor 57 because it has been idle for 60 seconds (new desired total will be 1)
18/02/26 08:51:42 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 57.
18/02/26 08:51:42 INFO scheduler.DAGScheduler: Executor lost: 57 (epoch 7)
18/02/26 08:51:42 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 57 from BlockManagerMaster.
18/02/26 08:51:42 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(57, dse-dev-dataproc-w-53.c.k-ddh-lle.internal, 53072, None)
18/02/26 08:51:42 INFO storage.BlockManagerMaster: Removed 57 successfully in removeExecutor
18/02/26 08:51:42 INFO cluster.YarnScheduler: Executor 57 on dse-dev-dataproc-w-53.c.k-ddh-lle.internal killed by driver.
18/02/26 08:51:42 INFO spark.ExecutorAllocationManager: Existing executor 57 has been removed (new total is 1)
Once the above task is completed when it starts the next Stage it looks for the shuffled data.
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 5 to 10.206.52.190:42676
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 7 to 10.206.52.157:45812
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 7 to 10.206.52.177:53612
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 5 to 10.206.52.166:41901
This window will hang for sometime and the tasks fail with the following exception.
18/02/26 09:12:33 INFO BlockManagerMaster: Removal of executor 21 requested
18/02/26 09:12:33 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
18/02/26 00:12:33 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
18/02/26 09:12:33 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1480517110174_0001_01_000049 on host: ip-10-138-114-125.ec2.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1480517110174_0001_01_000049
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My Job executor settings:
spark.driver.maxResultSize 3840m
spark.driver.memory 16G
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.maxExecutors 100
spark.dynamicAllocation.minExecutors 1
spark.executor.cores 6
spark.executor.id driver
spark.executor.instances 20
spark.executor.memory 30G
spark.hadoop.yarn.timeline-service.enabled False
spark.shuffle.service.enabled true
spark.sql.catalogImplementation hive
spark.sql.parquet.cacheMetadata false

Related

can driver program and cluster manager(resource manager) can available on same machine in spark stand alone

I'm doing spark submit from the same machine as spark master, using following command ./bin/spark-submit --master spark://ip:port --deploy-mode "client" test.py I'm my application running forever with following kind of output
22/11/18 13:17:37 INFO BlockManagerMaster: Removal of executor 8 requested
22/11/18 13:17:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 8
22/11/18 13:17:37 INFO StandaloneSchedulerBackend: Granted executor ID app-20221118131723-0008/10 on hostPort 192.168.210.94:37443 with 2 core(s), 1024.0 MiB RAM
22/11/18 13:17:37 INFO BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster.
22/11/18 13:17:37 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221118131723-0008/10 is now RUNNING
22/11/18 13:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221118131723-0008/9 is now EXITED (Command exited with code 1)
22/11/18 13:17:38 INFO StandaloneSchedulerBackend: Executor app-20221118131723-0008/9 removed: Command exited with code 1
22/11/18 13:17:38 INFO BlockManagerMaster: Removal of executor 9 requested
22/11/18 13:17:38 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 9
22/11/18 13:17:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 9 from BlockManagerMaster.
22/11/18 13:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221118131723-0008/11 on worker-20221118111836-192.168.210.82-46395 (192.168.210.82:4639
But when I run from other nodes, my application is running successfully what could be reason?

HDP3.1.4 - Spark2 with Hive Warehouse Connector error using spark-submit and pyspark shell: KeeperErrorCode = ConnectionLoss

Environment:
HDP 3.1.4 - configured and tested Hive server 2 - tested and working
Hive server 2 LLAP - tested and working Spark configured as per documentation to use Hive Warehouse Connector (HWC)
Apache Zeppelin - spark2 interpreter configured to use HWC
Trying to execute the following script:
from pyspark.sql import SparkSession
from pyspark_llap import HiveWarehouseSession
# Create spark session
spark = SparkSession.builder.appName("LLAP Test - CLI").enableHiveSupport().getOrCreate()
# Create HWC session
hive = HiveWarehouseSession.session(spark).userPassword('hive','hive').build()
# Execute a query to read from Spark using HWC
hive.executeQuery("select * from wifi_table where partit='2019-12-02'").show(20)
Problem:
When submitting an application with spark-submit or using the pyspark shell with the above script (or any script that executes a query to with the HiveWarehouseSession) and the spark job gets stuck, throwing an exception: java.lang.RuntimeException: java.io.IOException: shadecurator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
The command to execute is the following:
$ /usr/hdp/current/spark2-client/bin/spark-submit --master yarn --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar --py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.4.0-315.zip spark_compare_test.py
Here is the stacktrace:
[...]
20/01/03 12:39:55 INFO SparkContext: Starting job: showString at NativeMethodAccessorImpl.java:0
20/01/03 12:39:56 INFO DAGScheduler: Got job 0 (showString at NativeMethodAccessorImpl.java:0) with 1 output partitions
20/01/03 12:39:56 INFO DAGScheduler: Final stage: ResultStage 0 (showString at NativeMethodAccessorImpl.java:0)
20/01/03 12:39:56 INFO DAGScheduler: Parents of final stage: List()
20/01/03 12:39:56 INFO DAGScheduler: Missing parents: List()
20/01/03 12:39:56 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[3] at showString at NativeMethodAccessorImpl.java:0), which has no missing parents
20/01/03 12:39:56 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 9.5 KB, free 366.3 MB)
20/01/03 12:39:56 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.6 KB, free 366.3 MB)
20/01/03 12:39:56 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on EDGE01.machine:38050 (size: 3.6 KB, free: 366.3 MB)
20/01/03 12:39:56 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1039
20/01/03 12:39:56 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[3] at showString at NativeMethodAccessorImpl.java:0) (first 15 tasks are for partitions Vector(0))
20/01/03 12:39:56 INFO YarnScheduler: Adding task set 0.0 with 1 tasks
20/01/03 12:39:56 WARN TaskSetManager: Stage 0 contains a task of very large size (465 KB). The maximum recommended task size is 100 KB.
20/01/03 12:39:56 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, DN02.machine, executor 2, partition 0, NODE_LOCAL, 476705 bytes)
20/01/03 12:39:56 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on DN02.machine:41521 (size: 3.6 KB, free: 366.3 MB)
20/01/03 12:42:08 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, DN02.machine, executor 2): java.lang.RuntimeException: java.io.IOException: shadecurator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReaderFactory.createDataReader(HiveWarehouseDataReaderFactory.java:66)
at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD.compute(DataSourceRDD.scala:42)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: shadecurator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.hadoop.hive.registry.impl.ZkRegistryBase.ensureInstancesCache(ZkRegistryBase.java:619)
at org.apache.hadoop.hive.llap.registry.impl.LlapZookeeperRegistryImpl.getInstances(LlapZookeeperRegistryImpl.java:422)
at org.apache.hadoop.hive.llap.registry.impl.LlapZookeeperRegistryImpl.getInstances(LlapZookeeperRegistryImpl.java:63)
at org.apache.hadoop.hive.llap.registry.impl.LlapRegistryService.getInstances(LlapRegistryService.java:181)
at org.apache.hadoop.hive.llap.registry.impl.LlapRegistryService.getInstances(LlapRegistryService.java:177)
at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getServiceInstanceForHost(LlapBaseInputFormat.java:415)
at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getServiceInstance(LlapBaseInputFormat.java:397)
at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getRecordReader(LlapBaseInputFormat.java:160)
at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReader.getRecordReader(HiveWarehouseDataReader.java:72)
at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReader.<init>(HiveWarehouseDataReader.java:50)
at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReaderFactory.getDataReader(HiveWarehouseDataReaderFactory.java:72)
at com.hortonworks.spark.sql.hive.llap.HiveWarehouseDataReaderFactory.createDataReader(HiveWarehouseDataReaderFactory.java:64)
... 18 more
Caused by: shadecurator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at shadecurator.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
at shadecurator.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
at shadecurator.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
at shadecurator.org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:489)
at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:199)
at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:193)
at shadecurator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:190)
at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:175)
at shadecurator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32)
at shadecurator.org.apache.curator.framework.imps.CuratorFrameworkImpl.createContainers(CuratorFrameworkImpl.java:194)
at shadecurator.org.apache.curator.framework.EnsureContainers.internalEnsure(EnsureContainers.java:61)
at shadecurator.org.apache.curator.framework.EnsureContainers.ensure(EnsureContainers.java:53)
at shadecurator.org.apache.curator.framework.recipes.cache.PathChildrenCache.ensurePath(PathChildrenCache.java:576)
at shadecurator.org.apache.curator.framework.recipes.cache.PathChildrenCache.rebuild(PathChildrenCache.java:326)
at shadecurator.org.apache.curator.framework.recipes.cache.PathChildrenCache.start(PathChildrenCache.java:303)
at org.apache.hadoop.hive.registry.impl.ZkRegistryBase.ensureInstancesCache(ZkRegistryBase.java:597)
... 29 more
[...]
I have tried the following with no effect whatsoever:
Checked zookeeper health and connection limiting
Changed zookeeper hosts
Increased zookeeper timeout to 10s, 120s and 600s and no effect
Tried to submit the application on multiple nodes, the error persists
There is another strange behavior, running the script on the Zeppelin spark2 interpreter there is no error and the HWC works. I have compared the environments, and there is no configuration mismatch on the main variables.
At this point I'm stuck and don't know where to look for further troubleshooting. I can add more information as requested.

Optimization of spark cluster configuration( Low configuration virtual machine cluster)

When I use Spark's standalone mode to process a large number of datasets,the log said:
ERROR TaskSchedulerImpl:70 - Lost executor 1 on : Executor heartbeat timed out after 381181 ms
I search the internet, they say I should set parameters with spark submit:
[hadoop#Master spark2.4.0]$ bin/spark-submit --master spark://master:7077 --conf spark.worker.timeout 10000000 --py-files id.py id.py --name id
Error message in log:
Error: Invalid argument to --conf: spark.worker.timeout
Questions:
How to set timeout parameter?
Thanks to meniluca's answer, I lost the symbols in instructions
After adjusting the timeout, the log displays
2019-12-05 19:42:27 WARN Utils:87 - Suppressing exception in finally: broken pipe (Write failed)
java.net.SocketException: broken pipe (Write failed)
2019-12-05 21:13:09 INFO SparkContext:54 - Invoking stop() from shutdown hook
Exception in thread "serve-DataFrame" java.net.SocketException: Connection reset
Suppressed: java.net.SocketException: broken pipe (Write failed)
then,I change thessh,add ServerAliveInterval 60 while ~/.ssh/ config
ServerAliveInterval 60
the error stil exits, then I try to increase the driver memory, error still exists, and show that the connection is disconnected
[hadoop#Master spark2.4.0]$ bin/spark-submit --master spark://master:7077 --conf spark.worker.timeout=10000000 --driver-memory 1g --py-files id.py id.py --name id
2019-12-06 10:38:49 INFO ContextCleaner:54 - Cleaned accumulator 374
Exception in thread "serve-DataFrame" java.net.SocketException: broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:212)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRDD$$anonfun$serveIterator$1.apply(PythonRDD.scala:413)
at org.apache.spark.api.python.PythonRDD$$anonfun$serveIterator$1.apply(PythonRDD.scala:412)
at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
2019-12-06 11:06:12 WARN HeartbeatReceiver:66 - Removing executor 1 with no recent heartbeats: 149103 ms exceeds timeout 120000 ms
2019-12-06 11:06:12 ERROR TaskSchedulerImpl:70 - Lost executor 1 on 219.226.109.129: Executor heartbeat timed out after 149103 ms
2019-12-06 11:06:13 INFO SparkContext:54 - Invoking stop() from shutdown hook
2019-12-06 11:06:13 INFO DAGScheduler:54 - Executor lost: 1 (epoch 6)
2019-12-06 11:06:13 WARN HeartbeatReceiver:66 - Removing executor 0 with no recent heartbeats: 155761 ms exceeds timeout 120000 ms
2019-12-06 11:06:13 ERROR TaskSchedulerImpl:70 - Lost executor 0 on 219.226.109.131: Executor heartbeat timed out after 155761 ms
2019-12-06 11:06:13 INFO StandaloneSchedulerBackend:54 - Requesting to kill executor(s) 1
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 1 from BlockManagerMaster.
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Removing block manager BlockManagerId(1, 219.226.109.129, 42501, None)
2019-12-06 11:06:13 INFO BlockManagerMaster:54 - Removed 1 successfully in removeExecutor
2019-12-06 11:06:13 INFO DAGScheduler:54 - Shuffle files lost for executor: 1 (epoch 6)
2019-12-06 11:06:13 INFO StandaloneSchedulerBackend:54 - Actual list of executor(s) to be killed is 1
2019-12-06 11:06:13 INFO DAGScheduler:54 - Host added was in lost list earlier: 219.226.109.129
2019-12-06 11:06:13 INFO DAGScheduler:54 - Executor lost: 0 (epoch 7)
2019-12-06 11:06:13 INFO AbstractConnector:318 - Stopped Spark#490228e{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 0 from BlockManagerMaster.
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Removing block manager BlockManagerId(0, 219.226.109.131, 42164, None)
2019-12-06 11:06:13 INFO BlockManagerMaster:54 - Removed 0 successfully in removeExecutor
2019-12-06 11:06:13 INFO DAGScheduler:54 - Shuffle files lost for executor: 0 (epoch 7)
2019-12-06 11:06:13 INFO DAGScheduler:54 - Host added was in lost list earlier: 219.226.109.131
2019-12-06 11:06:13 INFO SparkUI:54 - Stopped Spark web UI at http://Master:4040
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Registering block manager 219.226.109.129:42501 with 413.9 MB RAM, BlockManagerId(1, 219.226.109.129, 42501, None)
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Registering block manager 219.226.109.131:42164 with 413.9 MB RAM, BlockManagerId(0, 219.226.109.131, 42164, None)
2019-12-06 11:06:14 INFO StandaloneSchedulerBackend:54 - Shutting down all executors
2019-12-06 11:06:14 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asking each executor to shut down
2019-12-06 11:06:14 INFO BlockManagerInfo:54 - Added broadcast_15_piece0 in memory on 219.226.109.129:42501 (size: 21.1 KB, free: 413.9 MB)
2019-12-06 11:06:15 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-12-06 11:06:15 INFO BlockManagerInfo:54 - Added broadcast_15_piece0 in memory on 219.226.109.131:42164 (size: 21.1 KB, free: 413.9 MB)
2019-12-06 11:06:16 INFO MemoryStore:54 - MemoryStore cleared
2019-12-06 11:06:16 INFO BlockManager:54 - BlockManager stopped
2019-12-06 11:06:16 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2019-12-06 11:06:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-12-06 11:06:17 ERROR TransportResponseHandler:144 - Still have 1 requests outstanding when connection from Master/219.226.109.130:7077 is closed
2019-12-06 11:06:17 INFO SparkContext:54 - Successfully stopped SparkContext
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Shutdown hook called
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-e2a29bac-7277-4476-ad23-315a27e9ccf0
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/localPyFiles-dd95954c-2e77-41ca-969d-a201269f5b5b
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bcd56b4a-fb32-4b58-a1d5-71abc5218d32
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-e2a29bac-7277-4476-ad23-315a27e9ccf0/pyspark-d04b799f-a116-44d5-b6a5-811cc8c03743
Question
Is SSH related to broken pipe?
Is increasing driver memory helpful to this problem?
I see the configuration posts on the Internet, but they're are highly configured. Since I use my computer to built clusters on virtual machine,
, the master has two cores , the slave has one core. How to adjust the configuration ?
Please try with
--conf spark.worker.timeout=10000000
you are missing the equal character between the configuration name and value.
java.net.SocketException: broken pipe (Write failed) occurs when something is wrong with the access port.
I suggest you to change the master which is at port 8080. The port can be changed either in the configuration file or via command-line options.
sbin/start-master.sh
Same can be tried with worker node as well if the above does not fix issue.
To see which ports are being used you can use :
sudo netstat -ltup

Spark executor lost when increasing the number of executor instances

My Hadoop cluster currently has 4 nodes and 45 cores running pyspark 2.4 through YARN. When I run spark-submit with one executor everything works fine, but if I change the number of executor-instances to 3 or 4 the executor is killed by the driver and only one task is working.
I have changed the below settings on Cloudera manager:
yarn.nodemanager.resource.memory-mb : 64 GB
yarn.nodemanager.resource.cpu-vcores:45
And below is the log that I get:
19/03/21 11:28:48 INFO cluster.YarnScheduler: Adding task set 0.0 with 1 tasks
19/03/21 11:28:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, datanode1, executor 2, partition 0, PROCESS_LOCAL, 7701 bytes)
19/03/21 11:28:48 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on datanode1:42432 (size: 71.0 KB, free: 366.2 MB)
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 1, 3
19/03/21 11:29:43 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 1, 3
19/03/21 11:29:43 INFO cluster.YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 1, 3
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 2)
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 1)
19/03/21 11:29:45 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
19/03/21 11:29:45 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch 0)
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, datanode2, 32853, None)
19/03/21 11:29:45 INFO storage.BlockManagerMaster: Removed 3 successfully in removeExecutor
19/03/21 11:29:45 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1.
19/03/21 11:29:45 INFO scheduler.DAGScheduler: Executor lost: 1 (epoch 0)
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, datanode3, 39466, None)
19/03/21 11:29:45 INFO storage.BlockManagerMaster: Removed 1 successfully in removeExecutor
19/03/21 11:29:45 INFO cluster.YarnScheduler: Executor 3 on datanode2 killed by driver.
19/03/21 11:29:45 INFO cluster.YarnScheduler: Executor 1 on datanode3 killed by driver.
19/03/21 11:29:45 INFO spark.ExecutorAllocationManager: Existing executor 3 has been removed (new total is 2)
19/03/21 11:29:45 INFO spark.ExecutorAllocationManager: Existing executor 1 has been removed (new total is 1)

Why would Spark executors be removed (with "ExecutorAllocationManager: Request to remove executorIds" in the logs)?

Im trying to execute a spark job in an AWS cluster of 6 c4.2xlarge nodes and I don't know why Spark is killing the executors...
Any help will be appreciated
Here the spark submit command:
. /usr/bin/spark-submit --packages="com.databricks:spark-avro_2.11:3.2.0" --jars RedshiftJDBC42-1.2.1.1001.jar --deploy-mode client --master yarn --num-executors 12 --executor-cores 3 --executor-memory 7G --driver-memory 7g --py-files dependencies.zip iface_extractions.py 2016-10-01 > output.log
At line this line starts to remove executors
17/05/25 14:42:50 INFO ExecutorAllocationManager: Request to remove executorIds: 5, 3
Output spark-submit log:
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-avro_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-avro_2.11;3.2.0 in central
found org.slf4j#slf4j-api;1.7.5 in central
found org.apache.avro#avro;1.7.6 in central
found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in central
found com.thoughtworks.paranamer#paranamer;2.3 in central
found org.xerial.snappy#snappy-java;1.0.5 in central
found org.apache.commons#commons-compress;1.4.1 in central
found org.tukaani#xz;1.0 in central
:: resolution report :: resolve 284ms :: artifacts dl 8ms
:: modules in use:
com.databricks#spark-avro_2.11;3.2.0 from central in [default]
com.thoughtworks.paranamer#paranamer;2.3 from central in [default]
org.apache.avro#avro;1.7.6 from central in [default]
org.apache.commons#commons-compress;1.4.1 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
org.slf4j#slf4j-api;1.7.5 from central in [default]
org.tukaani#xz;1.0 from central in [default]
org.xerial.snappy#snappy-java;1.0.5 from central in [default]
:: evicted modules:
org.slf4j#slf4j-api;1.6.4 by [org.slf4j#slf4j-api;1.7.5] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 10 | 0 | 0 | 1 || 9 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 9 already retrieved (0kB/8ms)
17/05/25 14:41:37 INFO SparkContext: Running Spark version 2.1.0
17/05/25 14:41:38 INFO SecurityManager: Changing view acls to: hadoop
17/05/25 14:41:38 INFO SecurityManager: Changing modify acls to: hadoop
17/05/25 14:41:38 INFO SecurityManager: Changing view acls groups to:
17/05/25 14:41:38 INFO SecurityManager: Changing modify acls groups to:
17/05/25 14:41:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
17/05/25 14:41:38 INFO Utils: Successfully started service 'sparkDriver' on port 37132.
17/05/25 14:41:38 INFO SparkEnv: Registering MapOutputTracker
17/05/25 14:41:38 INFO SparkEnv: Registering BlockManagerMaster
17/05/25 14:41:38 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/05/25 14:41:38 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/05/25 14:41:38 INFO DiskBlockManager: Created local directory at /mnt/tmp/blockmgr-e368a261-c1a1-49e7-8533-8081896a45e4
17/05/25 14:41:38 INFO MemoryStore: MemoryStore started with capacity 4.0 GB
17/05/25 14:41:38 INFO SparkEnv: Registering OutputCommitCoordinator
17/05/25 14:41:39 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/05/25 14:41:39 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.185.53.161:4040
17/05/25 14:41:39 INFO Utils: Using initial executors = 12, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/05/25 14:41:39 INFO RMProxy: Connecting to ResourceManager at ip-10-185-53-161.eu-west-1.compute.internal/10.185.53.161:8032
17/05/25 14:41:39 INFO Client: Requesting a new application from cluster with 5 NodeManagers
17/05/25 14:41:40 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)
17/05/25 14:41:40 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
17/05/25 14:41:40 INFO Client: Setting up container launch context for our AM
17/05/25 14:41:40 INFO Client: Setting up the launch environment for our AM container
17/05/25 14:41:40 INFO Client: Preparing resources for our AM container
17/05/25 14:41:40 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/05/25 14:41:42 INFO Client: Uploading resource file:/mnt/tmp/spark-4f534fa1-c377-4113-9c86-96d5cdab4cb5/__spark_libs__6500399427935716229.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/__spark_libs__6500399427935716229.zip
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/RedshiftJDBC42-1.2.1.1001.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/RedshiftJDBC42-1.2.1.1001.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/com.databricks_spark-avro_2.11-3.2.0.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/com.databricks_spark-avro_2.11-3.2.0.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.slf4j_slf4j-api-1.7.5.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.avro_avro-1.7.6.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.apache.avro_avro-1.7.6.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.codehaus.jackson_jackson-core-asl-1.9.13.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/com.thoughtworks.paranamer_paranamer-2.3.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.5.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.xerial.snappy_snappy-java-1.0.5.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.apache.commons_commons-compress-1.4.1.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.tukaani_xz-1.0.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.tukaani_xz-1.0.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/hive-site.xml
17/05/25 14:41:43 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/pyspark.zip
17/05/25 14:41:43 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.4-src.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/py4j-0.10.4-src.zip
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/dependencies.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/dependencies.zip
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.databricks_spark-avro_2.11-3.2.0.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.avro_avro-1.7.6.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.5.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.tukaani_xz-1.0.jar added multiple times to distributed cache.
17/05/25 14:41:43 INFO Client: Uploading resource file:/mnt/tmp/spark-4f534fa1-c377-4113-9c86-96d5cdab4cb5/__spark_conf__1516567354161750682.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/__spark_conf__.zip
17/05/25 14:41:43 INFO SecurityManager: Changing view acls to: hadoop
17/05/25 14:41:43 INFO SecurityManager: Changing modify acls to: hadoop
17/05/25 14:41:43 INFO SecurityManager: Changing view acls groups to:
17/05/25 14:41:43 INFO SecurityManager: Changing modify acls groups to:
17/05/25 14:41:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
17/05/25 14:41:43 INFO Client: Submitting application application_1495720658394_0004 to ResourceManager
17/05/25 14:41:43 INFO YarnClientImpl: Submitted application application_1495720658394_0004
17/05/25 14:41:43 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1495720658394_0004 and attemptId None
17/05/25 14:41:44 INFO Client: Application report for application_1495720658394_0004 (state: ACCEPTED)
17/05/25 14:41:44 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1495723303463
final status: UNDEFINED
tracking URL: http://ip-10-185-53-161.eu-west-1.compute.internal:20888/proxy/application_1495720658394_0004/
user: hadoop
17/05/25 14:41:45 INFO Client: Application report for application_1495720658394_0004 (state: ACCEPTED)
17/05/25 14:41:46 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
17/05/25 14:41:46 INFO Client: Application report for application_1495720658394_0004 (state: ACCEPTED)
17/05/25 14:41:46 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> ip-10-185-53-161.eu-west-1.compute.internal, PROXY_URI_BASES -> http://ip-10-185-53-161.eu-west-1.compute.internal:20888/proxy/application_1495720658394_0004), /proxy/application_1495720658394_0004
17/05/25 14:41:46 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
17/05/25 14:41:47 INFO Client: Application report for application_1495720658394_0004 (state: RUNNING)
17/05/25 14:41:47 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.185.52.31
ApplicationMaster RPC port: 0
queue: default
start time: 1495723303463
final status: UNDEFINED
tracking URL: http://ip-10-185-53-161.eu-west-1.compute.internal:20888/proxy/application_1495720658394_0004/
user: hadoop
17/05/25 14:41:47 INFO YarnClientSchedulerBackend: Application application_1495720658394_0004 has started running.
17/05/25 14:41:47 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37860.
17/05/25 14:41:47 INFO NettyBlockTransferService: Server created on 10.185.53.161:37860
17/05/25 14:41:47 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/05/25 14:41:47 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO BlockManagerMasterEndpoint: Registering block manager 10.185.53.161:37860 with 4.0 GB RAM, BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO BlockManager: external shuffle service port = 7337
17/05/25 14:41:47 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1495720658394_0004
17/05/25 14:41:47 INFO Utils: Using initial executors = 12, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.52.31:57406) with ID 5
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 5 has registered (new total is 1)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-52-31.eu-west-1.compute.internal:38781 with 4.0 GB RAM, BlockManagerId(5, ip-10-185-52-31.eu-west-1.compute.internal, 38781, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.45:40096) with ID 3
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 3 has registered (new total is 2)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-45.eu-west-1.compute.internal:43702 with 4.0 GB RAM, BlockManagerId(3, ip-10-185-53-45.eu-west-1.compute.internal, 43702, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.135:42390) with ID 2
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 2 has registered (new total is 3)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-135.eu-west-1.compute.internal:41552 with 4.0 GB RAM, BlockManagerId(2, ip-10-185-53-135.eu-west-1.compute.internal, 41552, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.10:60612) with ID 1
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 4)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-10.eu-west-1.compute.internal:33391 with 4.0 GB RAM, BlockManagerId(1, ip-10-185-53-10.eu-west-1.compute.internal, 33391, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.68:57424) with ID 4
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 4 has registered (new total is 5)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-68.eu-west-1.compute.internal:34222 with 4.0 GB RAM, BlockManagerId(4, ip-10-185-53-68.eu-west-1.compute.internal, 34222, None)
17/05/25 14:42:09 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
17/05/25 14:42:09 INFO SharedState: Warehouse path is 'hdfs:///user/spark/warehouse'.
17/05/25 14:42:10 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
17/05/25 14:42:11 INFO CodeGenerator: Code generated in 170.416763 ms
17/05/25 14:42:11 INFO SparkContext: Starting job: collect at /home/hadoop/iface_extractions/select_fields.py:90
17/05/25 14:42:11 INFO DAGScheduler: Got job 0 (collect at /home/hadoop/iface_extractions/select_fields.py:90) with 1 output partitions
17/05/25 14:42:11 INFO DAGScheduler: Final stage: ResultStage 0 (collect at /home/hadoop/iface_extractions/select_fields.py:90)
17/05/25 14:42:11 INFO DAGScheduler: Parents of final stage: List()
17/05/25 14:42:11 INFO DAGScheduler: Missing parents: List()
17/05/25 14:42:11 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at collect at /home/hadoop/iface_extractions/select_fields.py:90), which has no missing parents
17/05/25 14:42:11 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 7.5 KB, free 4.0 GB)
17/05/25 14:42:11 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.1 KB, free 4.0 GB)
17/05/25 14:42:11 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.185.53.161:37860 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:11 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
17/05/25 14:42:11 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at collect at /home/hadoop/iface_extractions/select_fields.py:90)
17/05/25 14:42:11 INFO YarnScheduler: Adding task set 0.0 with 1 tasks
17/05/25 14:42:11 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-10-185-53-135.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5899 bytes)
17/05/25 14:42:11 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-10-185-53-135.eu-west-1.compute.internal:41552 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1101 ms on ip-10-185-53-135.eu-west-1.compute.internal (executor 2) (1/1)
17/05/25 14:42:12 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/05/25 14:42:12 INFO DAGScheduler: ResultStage 0 (collect at /home/hadoop/iface_extractions/select_fields.py:90) finished in 1.109 s
17/05/25 14:42:12 INFO DAGScheduler: Job 0 finished: collect at /home/hadoop/iface_extractions/select_fields.py:90, took 1.290037 s
17/05/25 14:42:12 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 10.185.53.161:37860 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO SparkContext: Starting job: collect at /home/hadoop/iface_extractions/select_fields.py:91
17/05/25 14:42:12 INFO BlockManagerInfo: Removed broadcast_0_piece0 on ip-10-185-53-135.eu-west-1.compute.internal:41552 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO DAGScheduler: Got job 1 (collect at /home/hadoop/iface_extractions/select_fields.py:91) with 1 output partitions
17/05/25 14:42:12 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /home/hadoop/iface_extractions/select_fields.py:91)
17/05/25 14:42:12 INFO DAGScheduler: Parents of final stage: List()
17/05/25 14:42:12 INFO DAGScheduler: Missing parents: List()
17/05/25 14:42:12 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at collect at /home/hadoop/iface_extractions/select_fields.py:91), which has no missing parents
17/05/25 14:42:12 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.5 KB, free 4.0 GB)
17/05/25 14:42:12 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.1 KB, free 4.0 GB)
17/05/25 14:42:12 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.185.53.161:37860 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996
17/05/25 14:42:12 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at collect at /home/hadoop/iface_extractions/select_fields.py:91)
17/05/25 14:42:12 INFO YarnScheduler: Adding task set 1.0 with 1 tasks
17/05/25 14:42:12 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ip-10-185-53-68.eu-west-1.compute.internal, executor 4, partition 0, PROCESS_LOCAL, 5900 bytes)
17/05/25 14:42:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-10-185-53-68.eu-west-1.compute.internal:34222 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:14 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 1047 ms on ip-10-185-53-68.eu-west-1.compute.internal (executor 4) (1/1)
17/05/25 14:42:14 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/05/25 14:42:14 INFO DAGScheduler: ResultStage 1 (collect at /home/hadoop/iface_extractions/select_fields.py:91) finished in 1.047 s
17/05/25 14:42:14 INFO DAGScheduler: Job 1 finished: collect at /home/hadoop/iface_extractions/select_fields.py:91, took 1.054768 s
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 13.109425 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 12.568665 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 11.257538 ms
17/05/25 14:42:14 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 10.185.53.161:37860 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:14 INFO BlockManagerInfo: Removed broadcast_1_piece0 on ip-10-185-53-68.eu-west-1.compute.internal:34222 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 11.563958 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 18.189301 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 13.490762 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 15.156166 ms
17/05/25 14:42:50 INFO ExecutorAllocationManager: Request to remove executorIds: 5, 3
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 5, 3
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 5, 3
17/05/25 14:42:50 INFO ExecutorAllocationManager: Removing executor 5 because it has been idle for 60 seconds (new desired total will be 4)
17/05/25 14:42:50 INFO ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 3)
17/05/25 14:42:50 INFO ExecutorAllocationManager: Request to remove executorIds: 1
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 1
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 1
17/05/25 14:42:50 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 2)
17/05/25 14:42:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 5.
17/05/25 14:42:50 INFO DAGScheduler: Executor lost: 5 (epoch 0)
17/05/25 14:42:50 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
17/05/25 14:42:50 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-185-52-31.eu-west-1.compute.internal, 38781, None)
17/05/25 14:42:50 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor
17/05/25 14:42:50 INFO YarnScheduler: Executor 5 on ip-10-185-52-31.eu-west-1.compute.internal killed by driver.
17/05/25 14:42:50 INFO ExecutorAllocationManager: Existing executor 5 has been removed (new total is 4)
17/05/25 14:42:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1.
17/05/25 14:42:51 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-10-185-53-10.eu-west-1.compute.internal, 33391, None)
17/05/25 14:42:51 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
17/05/25 14:42:51 INFO YarnScheduler: Executor 1 on ip-10-185-53-10.eu-west-1.compute.internal killed by driver.
17/05/25 14:42:51 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 3)
17/05/25 14:42:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
17/05/25 14:42:51 INFO DAGScheduler: Executor lost: 3 (epoch 0)
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, ip-10-185-53-45.eu-west-1.compute.internal, 43702, None)
17/05/25 14:42:51 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
17/05/25 14:42:51 INFO YarnScheduler: Executor 3 on ip-10-185-53-45.eu-west-1.compute.internal killed by driver.
17/05/25 14:42:51 INFO ExecutorAllocationManager: Existing executor 3 has been removed (new total is 2)
17/05/25 14:43:12 INFO ExecutorAllocationManager: Request to remove executorIds: 2
17/05/25 14:43:12 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 2
17/05/25 14:43:12 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 2
17/05/25 14:43:12 INFO ExecutorAllocationManager: Removing executor 2 because it has been idle for 60 seconds (new desired total will be 1)
17/05/25 14:43:13 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 2.
17/05/25 14:43:13 INFO DAGScheduler: Executor lost: 2 (epoch 0)
17/05/25 14:43:13 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
17/05/25 14:43:13 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, ip-10-185-53-135.eu-west-1.compute.internal, 41552, None)
17/05/25 14:43:13 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
17/05/25 14:43:13 INFO YarnScheduler: Executor 2 on ip-10-185-53-135.eu-west-1.compute.internal killed by driver.
17/05/25 14:43:13 INFO ExecutorAllocationManager: Existing executor 2 has been removed (new total is 1)
17/05/25 14:43:14 INFO ExecutorAllocationManager: Request to remove executorIds: 4
17/05/25 14:43:14 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 4
17/05/25 14:43:14 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 4
17/05/25 14:43:14 INFO ExecutorAllocationManager: Removing executor 4 because it has been idle for 60 seconds (new desired total will be 0)
17/05/25 14:43:17 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 4.
17/05/25 14:43:17 INFO DAGScheduler: Executor lost: 4 (epoch 0)
17/05/25 14:43:17 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
17/05/25 14:43:17 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(4, ip-10-185-53-68.eu-west-1.compute.internal, 34222, None)
17/05/25 14:43:17 INFO BlockManagerMaster: Removed 4 successfully in removeExecutor
17/05/25 14:43:17 INFO YarnScheduler: Executor 4 on ip-10-185-53-68.eu-west-1.compute.internal killed by driver.
17/05/25 14:43:17 INFO ExecutorAllocationManager: Existing executor 4 has been removed (new total is 0)
My guess is that you've got Dynamic Resource Allocation enabled in your Spark configuration.
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
This feature is disabled by default and available on all coarse-grained cluster managers, i.e. standalone mode, YARN mode, and Mesos coarse-grained mode.
I highlighted the relevant part that says it is disabled by default and hence I can only guess that it was enabled.
From ExecutorAllocationManager:
An agent that dynamically allocates and removes executors based on the workload.
With that said, I'd use web UI and see if spark.dynamicAllocation.enabled property is enabled or not.
There are two requirements for using this feature (Dynamic Resource Allocation). First, your application must set spark.dynamicAllocation.enabled to true. Second, you must set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application.
This is the line that prints out the INFO message:
logInfo("Request to remove executorIds: " + executors.mkString(", "))
You can also kill executors using SparkContext.killExecutors that gives a Spark developer a way to kill executors himself.
killExecutors(executorIds: Seq[String]): Boolean Request that the cluster manager kill the specified executors.
There are two killExecutors actually and they are very helpful for demo purposes as you can easily show how executors come and go.

Resources