Optimization of spark cluster configuration( Low configuration virtual machine cluster) - apache-spark

When I use Spark's standalone mode to process a large number of datasets,the log said:
ERROR TaskSchedulerImpl:70 - Lost executor 1 on : Executor heartbeat timed out after 381181 ms
I search the internet, they say I should set parameters with spark submit:
[hadoop#Master spark2.4.0]$ bin/spark-submit --master spark://master:7077 --conf spark.worker.timeout 10000000 --py-files id.py id.py --name id
Error message in log:
Error: Invalid argument to --conf: spark.worker.timeout
Questions:
How to set timeout parameter?
Thanks to meniluca's answer, I lost the symbols in instructions
After adjusting the timeout, the log displays
2019-12-05 19:42:27 WARN Utils:87 - Suppressing exception in finally: broken pipe (Write failed)
java.net.SocketException: broken pipe (Write failed)
2019-12-05 21:13:09 INFO SparkContext:54 - Invoking stop() from shutdown hook
Exception in thread "serve-DataFrame" java.net.SocketException: Connection reset
Suppressed: java.net.SocketException: broken pipe (Write failed)
then,I change thessh,add ServerAliveInterval 60 while ~/.ssh/ config
ServerAliveInterval 60
the error stil exits, then I try to increase the driver memory, error still exists, and show that the connection is disconnected
[hadoop#Master spark2.4.0]$ bin/spark-submit --master spark://master:7077 --conf spark.worker.timeout=10000000 --driver-memory 1g --py-files id.py id.py --name id
2019-12-06 10:38:49 INFO ContextCleaner:54 - Cleaned accumulator 374
Exception in thread "serve-DataFrame" java.net.SocketException: broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:212)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRDD$$anonfun$serveIterator$1.apply(PythonRDD.scala:413)
at org.apache.spark.api.python.PythonRDD$$anonfun$serveIterator$1.apply(PythonRDD.scala:412)
at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
2019-12-06 11:06:12 WARN HeartbeatReceiver:66 - Removing executor 1 with no recent heartbeats: 149103 ms exceeds timeout 120000 ms
2019-12-06 11:06:12 ERROR TaskSchedulerImpl:70 - Lost executor 1 on 219.226.109.129: Executor heartbeat timed out after 149103 ms
2019-12-06 11:06:13 INFO SparkContext:54 - Invoking stop() from shutdown hook
2019-12-06 11:06:13 INFO DAGScheduler:54 - Executor lost: 1 (epoch 6)
2019-12-06 11:06:13 WARN HeartbeatReceiver:66 - Removing executor 0 with no recent heartbeats: 155761 ms exceeds timeout 120000 ms
2019-12-06 11:06:13 ERROR TaskSchedulerImpl:70 - Lost executor 0 on 219.226.109.131: Executor heartbeat timed out after 155761 ms
2019-12-06 11:06:13 INFO StandaloneSchedulerBackend:54 - Requesting to kill executor(s) 1
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 1 from BlockManagerMaster.
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Removing block manager BlockManagerId(1, 219.226.109.129, 42501, None)
2019-12-06 11:06:13 INFO BlockManagerMaster:54 - Removed 1 successfully in removeExecutor
2019-12-06 11:06:13 INFO DAGScheduler:54 - Shuffle files lost for executor: 1 (epoch 6)
2019-12-06 11:06:13 INFO StandaloneSchedulerBackend:54 - Actual list of executor(s) to be killed is 1
2019-12-06 11:06:13 INFO DAGScheduler:54 - Host added was in lost list earlier: 219.226.109.129
2019-12-06 11:06:13 INFO DAGScheduler:54 - Executor lost: 0 (epoch 7)
2019-12-06 11:06:13 INFO AbstractConnector:318 - Stopped Spark#490228e{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 0 from BlockManagerMaster.
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Removing block manager BlockManagerId(0, 219.226.109.131, 42164, None)
2019-12-06 11:06:13 INFO BlockManagerMaster:54 - Removed 0 successfully in removeExecutor
2019-12-06 11:06:13 INFO DAGScheduler:54 - Shuffle files lost for executor: 0 (epoch 7)
2019-12-06 11:06:13 INFO DAGScheduler:54 - Host added was in lost list earlier: 219.226.109.131
2019-12-06 11:06:13 INFO SparkUI:54 - Stopped Spark web UI at http://Master:4040
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Registering block manager 219.226.109.129:42501 with 413.9 MB RAM, BlockManagerId(1, 219.226.109.129, 42501, None)
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Registering block manager 219.226.109.131:42164 with 413.9 MB RAM, BlockManagerId(0, 219.226.109.131, 42164, None)
2019-12-06 11:06:14 INFO StandaloneSchedulerBackend:54 - Shutting down all executors
2019-12-06 11:06:14 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asking each executor to shut down
2019-12-06 11:06:14 INFO BlockManagerInfo:54 - Added broadcast_15_piece0 in memory on 219.226.109.129:42501 (size: 21.1 KB, free: 413.9 MB)
2019-12-06 11:06:15 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-12-06 11:06:15 INFO BlockManagerInfo:54 - Added broadcast_15_piece0 in memory on 219.226.109.131:42164 (size: 21.1 KB, free: 413.9 MB)
2019-12-06 11:06:16 INFO MemoryStore:54 - MemoryStore cleared
2019-12-06 11:06:16 INFO BlockManager:54 - BlockManager stopped
2019-12-06 11:06:16 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2019-12-06 11:06:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-12-06 11:06:17 ERROR TransportResponseHandler:144 - Still have 1 requests outstanding when connection from Master/219.226.109.130:7077 is closed
2019-12-06 11:06:17 INFO SparkContext:54 - Successfully stopped SparkContext
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Shutdown hook called
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-e2a29bac-7277-4476-ad23-315a27e9ccf0
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/localPyFiles-dd95954c-2e77-41ca-969d-a201269f5b5b
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bcd56b4a-fb32-4b58-a1d5-71abc5218d32
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-e2a29bac-7277-4476-ad23-315a27e9ccf0/pyspark-d04b799f-a116-44d5-b6a5-811cc8c03743
Question
Is SSH related to broken pipe?
Is increasing driver memory helpful to this problem?
I see the configuration posts on the Internet, but they're are highly configured. Since I use my computer to built clusters on virtual machine,
, the master has two cores , the slave has one core. How to adjust the configuration ?

Please try with
--conf spark.worker.timeout=10000000
you are missing the equal character between the configuration name and value.

java.net.SocketException: broken pipe (Write failed) occurs when something is wrong with the access port.
I suggest you to change the master which is at port 8080. The port can be changed either in the configuration file or via command-line options.
sbin/start-master.sh
Same can be tried with worker node as well if the above does not fix issue.
To see which ports are being used you can use :
sudo netstat -ltup

Related

FileNotFoundException on submitting Spark Jobs to remote

I've created an environment where I've set up 3 Docker containers, 1 for Airflow using the puckel/docker-airflow image with spark and hadoop additionally installed. The other two containers are basically imitating spark master and worker (used gettyimages/spark Docker image to create this). All 3 containers are connected to each other via a bridge network, so all containers are able to communicate with each other.
What I'm trying to do next is to submit spark job from the Airflow container to the Spark cluster (master).
As an initial example, I'm using the wordcount sample script. I created a sample.txt file in the airflow container at path usr/local/airflow/sample.txt. I've bashed into the Airflow container and I'm using the command given below to run the wordcount.py on spark master located at the ip which I found after inspecting the bridge network.
spark-submit --master spark://ipaddress:7077 --files usr/local/airflow/sample.txt /opt/spark-2.4.1/examples/src/main/python/wordcount.py sample.txt
After submitting the script, from the logs, I can see that a connection has been established with the master (from airflow container), and it also copied the file specified by --files to the master and worker, but then it just errors out saying,
java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
As per my understanding (could be wrong), but when we specify files to copy to master using --files you can access it directly via the file name (sample.txt in my case). So what I'm trying to figure out is if a job has been submitted and the file has been copied to master, then why is it searching in the location file:/usr/local/airflow/sample.txt? How do I make it refer to the correct path?
I apologize as this question has been asked a couple of times, but I've read all the related question on stackoverflow, but I'm still unable to resolve this. I'd really appreciate y'alls help on this.
Thanks.
The full log below,
user#machine:/usr/local/airflow# spark-submit --master spark://172.22.0.2:7077 --files sample.txt /opt/spark-2.4.1/examples/src/main/python/wordcount.py ./sample.txt
20/07/25 03:23:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/25 03:23:35 INFO SparkContext: Running Spark version 2.4.1
20/07/25 03:23:35 INFO SparkContext: Submitted application: PythonWordCount
20/07/25 03:23:35 INFO SecurityManager: Changing view acls to: root
20/07/25 03:23:35 INFO SecurityManager: Changing modify acls to: root
20/07/25 03:23:35 INFO SecurityManager: Changing view acls groups to:
20/07/25 03:23:35 INFO SecurityManager: Changing modify acls groups to:
20/07/25 03:23:35 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
20/07/25 03:23:35 INFO Utils: Successfully started service 'sparkDriver' on port 33457.
20/07/25 03:23:35 INFO SparkEnv: Registering MapOutputTracker
20/07/25 03:23:36 INFO SparkEnv: Registering BlockManagerMaster
20/07/25 03:23:36 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/07/25 03:23:36 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/07/25 03:23:36 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-dd1957de-6907-484d-a3d8-2b3b88e0c7ca
20/07/25 03:23:36 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
20/07/25 03:23:36 INFO SparkEnv: Registering OutputCommitCoordinator
20/07/25 03:23:36 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/07/25 03:23:36 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://0508a77fcaad:4040
20/07/25 03:23:37 INFO SparkContext: Added file file:///usr/local/airflow/sample.txt at spark://0508a77fcaad:33457/files/sample.txt with timestamp 1595647417081
20/07/25 03:23:37 INFO Utils: Copying /usr/local/airflow/sample.txt to /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0/userFiles-74f8cfe4-8a19-4d2e-8fa1-1f0bd1f0ef12/sample.txt
20/07/25 03:23:37 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://172.22.0.2:7077...
20/07/25 03:23:37 INFO TransportClientFactory: Successfully created connection to /172.22.0.2:7077 after 32 ms (0 ms spent in bootstraps)
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20200725032338-0003
20/07/25 03:23:38 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45057.
20/07/25 03:23:38 INFO NettyBlockTransferService: Server created on 0508a77fcaad:45057
20/07/25 03:23:38 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/07/25 03:23:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200725032338-0003/0 on worker-20200725025003-172.22.0.4-8881 (172.22.0.4:8881) with 2 core(s)
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20200725032338-0003/0 on hostPort 172.22.0.4:8881 with 2 core(s), 1024.0 MB RAM
20/07/25 03:23:38 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManagerMasterEndpoint: Registering block manager 0508a77fcaad:45057 with 366.3 MB RAM, BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 0508a77fcaad, 45057, None)
20/07/25 03:23:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200725032338-0003/0 is now RUNNING
20/07/25 03:23:38 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.020/07/25 03:23:38 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/usr/local/airflow/spark-warehouse').
20/07/25 03:23:38 INFO SharedState: Warehouse path is 'file:/usr/local/airflow/spark-warehouse'.
20/07/25 03:23:40 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/07/25 03:23:47 INFO FileSourceStrategy: Pruning directories with:
20/07/25 03:23:47 INFO FileSourceStrategy: Post-Scan Filters:
20/07/25 03:23:47 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
20/07/25 03:23:47 INFO FileSourceScanExec: Pushed Filters:
20/07/25 03:23:51 INFO CodeGenerator: Code generated in 2187.926234 ms
20/07/25 03:23:53 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 220.9 KB, free 366.1 MB)
20/07/25 03:23:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 20.8 KB, free 366.1 MB)
20/07/25 03:23:55 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 0508a77fcaad:45057 (size: 20.8 KB, free: 366.3 MB)
20/07/25 03:23:55 INFO SparkContext: Created broadcast 0 from javaToPython at NativeMethodAccessorImpl.java:0
20/07/25 03:23:55 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
20/07/25 03:23:57 INFO SparkContext: Starting job: collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40
20/07/25 03:23:58 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (172.22.0.4:59324) with ID 0
20/07/25 03:23:58 INFO DAGScheduler: Registering RDD 5 (reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39)
20/07/25 03:23:58 INFO DAGScheduler: Got job 0 (collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40) with 1 output partitions
20/07/25 03:23:58 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40)
20/07/25 03:23:58 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
20/07/25 03:23:58 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
20/07/25 03:23:58 INFO DAGScheduler: Submitting ShuffleMapStage 0 (PairwiseRDD[5] at reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39), which has no missing parents
20/07/25 03:23:58 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 15.2 KB, free 366.0 MB)
20/07/25 03:23:58 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 9.1 KB, free 366.0 MB)
20/07/25 03:23:58 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 0508a77fcaad:45057 (size: 9.1 KB, free: 366.3 MB)
20/07/25 03:23:58 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
20/07/25 03:23:58 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (PairwiseRDD[5] at reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39) (first 15 tasks are for partitions Vector(0))
20/07/25 03:23:58 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
20/07/25 03:23:58 INFO BlockManagerMasterEndpoint: Registering block manager 172.22.0.4:45435 with 366.3 MB RAM, BlockManagerId(0, 172.22.0.4, 45435, None)
20/07/25 03:23:58 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:03 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 172.22.0.4:45435 (size: 9.1 KB, free: 366.3 MB)
20/07/25 03:24:09 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.22.0.4:45435 (size: 20.8 KB, free: 366.3 MB)
20/07/25 03:24:11 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
20/07/25 03:24:11 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:11 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 1]
20/07/25 03:24:11 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:12 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 2]
20/07/25 03:24:12 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0, partition 0, PROCESS_LOCAL, 8307 bytes)
20/07/25 03:24:12 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on 172.22.0.4, executor 0: java.io.FileNotFoundException (File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.) [duplicate 3]
20/07/25 03:24:12 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
20/07/25 03:24:12 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
20/07/25 03:24:12 INFO TaskSchedulerImpl: Cancelling stage 0
20/07/25 03:24:12 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
20/07/25 03:24:12 INFO DAGScheduler: ShuffleMapStage 0 (reduceByKey at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:39) failed in 13.690 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
Driver stacktrace:
20/07/25 03:24:12 INFO DAGScheduler: Job 0 failed: collect at /opt/spark-2.4.1/examples/src/main/python/wordcount.py:40, took 14.579961 s
Traceback (most recent call last):
File "/opt/spark-2.4.1/examples/src/main/python/wordcount.py", line 40, in <module>
output = counts.collect()
File "/opt/spark-2.4.1/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
File "/opt/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/opt/spark-2.4.1/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/opt/spark-2.4.1/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.22.0.4, executor 0): java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:945)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.collect(RDD.scala:944)
at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:166)
at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: File file:/usr/local/airflow/sample.txt does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:153)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:557)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:345)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:194)
20/07/25 03:24:13 INFO SparkContext: Invoking stop() from shutdown hook
20/07/25 03:24:13 INFO SparkUI: Stopped Spark web UI at http://0508a77fcaad:4040
20/07/25 03:24:13 INFO StandaloneSchedulerBackend: Shutting down all executors
20/07/25 03:24:13 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/07/25 03:24:16 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/25 03:24:16 INFO MemoryStore: MemoryStore cleared
20/07/25 03:24:16 INFO BlockManager: BlockManager stopped
20/07/25 03:24:16 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/25 03:24:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/25 03:24:16 INFO SparkContext: Successfully stopped SparkContext
20/07/25 03:24:16 INFO ShutdownHookManager: Shutdown hook called
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-2dfb2222-d56c-4ee1-ab62-86e71e5e751b
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0
20/07/25 03:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-f9dfe6ee-22d7-4747-beab-9450fc1afce0/pyspark-2ee74d07-6606-4edc-8420-fe46212c50e5
Change your spark-submit like below for submitting your spark job.
spark-submit \
--master spark://ipaddress:7077 \
--deploy-mode cluster # add this if you want to pass file name to wordcount.py
--files usr/local/airflow/sample.txt \
/opt/spark-2.4.1/examples/src/main/python/wordcount.py sample.txt
OR
spark-submit \
--master spark://ipaddress:7077 \
/opt/spark-2.4.1/examples/src/main/python/wordcount.py /usr/local/airflow/sample.txt

Spark Pod restarting every hour in Kubernetes

I have deployed spark applications in cluster-mode in kubernetes. The spark application pod is getting restarted almost every hour.
The driver log has this message before restart:
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 1 on x.x.x.x: The executor with id 1 was deleted by a user or the framework.
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 2 on y.y.y.y: The executor with id 2 was deleted by a user or the framework.
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 1 (epoch 0)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, x.x.x.x, 44879, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 0)
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 2 (epoch 1)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, y.y.y.y, 46191, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 1)
20/07/11 13:34:02 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
20/07/11 13:34:16 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
And the Executor log has:
20/07/11 15:55:01 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
20/07/11 15:55:01 INFO MemoryStore: MemoryStore cleared
20/07/11 15:55:01 INFO BlockManager: BlockManager stopped
20/07/11 15:55:01 INFO ShutdownHookManager: Shutdown hook called
How can I find what's causing the executors deletion?
Deployment:
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 0 max surge
Pod Template:
Labels: app=test
chart=test-2.0.0
heritage=Tiller
product=testp
release=test
service=test-spark
Containers:
test-spark:
Image: test-spark:2df66df06c
Port: <none>
Host Port: <none>
Command:
/spark/bin/start-spark.sh
Args:
while true; do sleep 30; done;
Limits:
memory: 4Gi
Requests:
memory: 4Gi
Liveness: exec [/spark/bin/liveness-probe.sh] delay=300s timeout=1s period=30s #success=1 #failure=10
Environment:
JVM_ARGS: -Xms256m -Xmx1g
KUBERNETES_MASTER: https://kubernetes.default.svc
KUBERNETES_NAMESPACE: test-spark
IMAGE_PULL_POLICY: Always
DRIVER_CPU: 1
DRIVER_MEMORY: 2048m
EXECUTOR_CPU: 1
EXECUTOR_MEMORY: 2048m
EXECUTOR_INSTANCES: 2
KAFKA_ADVERTISED_HOST_NAME: kafka.default:9092
ENRICH_KAFKA_ENRICHED_EVENTS_TOPICS: test-events
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: test-spark-5c5997b459 (1/1 replicas created)
Events: <none>
I did a quick research on running Spark on Kubernetes, and it seems that Spark by design will terminate executor pod when they finished running Spark applications. Quoted from the official Spark website:
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
Therefore, I believe there is nothing to worry about the restarts as long as your Spark instance still manages to start executor pods as and when required.
Reference: https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#how-it-works
I don't know how you configured your application pod but you can use this to stop restarting pod include this in your deployment yaml file so that pod will never restart and you can debug the pod onwards.
restartPolicy: Never

Spark executor lost when increasing the number of executor instances

My Hadoop cluster currently has 4 nodes and 45 cores running pyspark 2.4 through YARN. When I run spark-submit with one executor everything works fine, but if I change the number of executor-instances to 3 or 4 the executor is killed by the driver and only one task is working.
I have changed the below settings on Cloudera manager:
yarn.nodemanager.resource.memory-mb : 64 GB
yarn.nodemanager.resource.cpu-vcores:45
And below is the log that I get:
19/03/21 11:28:48 INFO cluster.YarnScheduler: Adding task set 0.0 with 1 tasks
19/03/21 11:28:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, datanode1, executor 2, partition 0, PROCESS_LOCAL, 7701 bytes)
19/03/21 11:28:48 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on datanode1:42432 (size: 71.0 KB, free: 366.2 MB)
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 1, 3
19/03/21 11:29:43 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 1, 3
19/03/21 11:29:43 INFO cluster.YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 1, 3
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 2)
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 1)
19/03/21 11:29:45 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
19/03/21 11:29:45 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch 0)
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, datanode2, 32853, None)
19/03/21 11:29:45 INFO storage.BlockManagerMaster: Removed 3 successfully in removeExecutor
19/03/21 11:29:45 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1.
19/03/21 11:29:45 INFO scheduler.DAGScheduler: Executor lost: 1 (epoch 0)
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, datanode3, 39466, None)
19/03/21 11:29:45 INFO storage.BlockManagerMaster: Removed 1 successfully in removeExecutor
19/03/21 11:29:45 INFO cluster.YarnScheduler: Executor 3 on datanode2 killed by driver.
19/03/21 11:29:45 INFO cluster.YarnScheduler: Executor 1 on datanode3 killed by driver.
19/03/21 11:29:45 INFO spark.ExecutorAllocationManager: Existing executor 3 has been removed (new total is 2)
19/03/21 11:29:45 INFO spark.ExecutorAllocationManager: Existing executor 1 has been removed (new total is 1)

Collect failed in ... s due to Stage cancelled because SparkContext was shut down

I want to display the number of elements in each partition, so I write the following:
def count_in_a_partition(iterator):
yield sum(1 for _ in iterator)
If I use it like this
print("number of element in each partitions: {}".format(
my_rdd.mapPartitions(count_in_a_partition).collect()
))
I get the following:
19/02/18 21:41:15 INFO DAGScheduler: Job 3 failed: collect at /project/6008168/tamouze/testSparkCedar.py:435, took 30.859710 s
19/02/18 21:41:15 INFO DAGScheduler: ResultStage 3 (collect at /project/6008168/tamouze/testSparkCedar.py:435) failed in 30.848 s due to Stage cancelled because SparkContext was shut down
19/02/18 21:41:15 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/02/18 21:41:16 INFO MemoryStore: MemoryStore cleared
19/02/18 21:41:16 INFO BlockManager: BlockManager stopped
19/02/18 21:41:16 INFO BlockManagerMaster: BlockManagerMaster stopped
19/02/18 21:41:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/02/18 21:41:16 WARN BlockManager: Putting block rdd_3_14 failed due to exception java.net.SocketException: Connection reset.
19/02/18 21:41:16 WARN BlockManager: Block rdd_3_14 could not be removed as it was not found on disk or in memory
19/02/18 21:41:16 WARN BlockManager: Putting block rdd_3_3 failed due to exception java.net.SocketException: Connection reset.
19/02/18 21:41:16 WARN BlockManager: Block rdd_3_3 could not be removed as it was not found on disk or in memory
19/02/18 21:41:16 INFO SparkContext: Successfully stopped SparkContext
....
noting that my_rdd.take(1) return :
[(u'id', u'text', array([-0.31921682, ...,0.890875]))]
How can I solve this issue?
You have to use glom() function for that. Let’s take an example.
Let's create a DataFrame first.
rdd=sc.parallelize([('a',22),('b',1),('c',4),('b',1),('d',2),('e',0),('d',3),('a',1),('c',4),('b',7),('a',2),('f',1)] )
df=rdd.toDF(['key','value'])
df=df.repartition(5,"key") # Make 5 Partitions
The number of partitions -
print("Number of partitions: {}".format(df.rdd.getNumPartitions()))
Number of partitions: 5
Number of rows/elements on each partition. This can give you an idea of skew -
print('Partitioning distribution: '+ str(df.rdd.glom().map(len).collect()))
Partitioning distribution: [3, 3, 2, 2, 2]
See how actually are rows distributed on the partitions. Behold that if the dataset is big, then your system could crash because of Out of Memory OOM issue.
print("Partitions structure: {}".format(df.rdd.glom().collect()))
Partitions structure: [
#Partition 1 [Row(key='a', value=22), Row(key='a', value=1), Row(key='a', value=2)],
#Partition 2 [Row(key='b', value=1), Row(key='b', value=1), Row(key='b', value=7)],
#Partition 3 [Row(key='c', value=4), Row(key='c', value=4)],
#Partition 4 [Row(key='e', value=0), Row(key='f', value=1)],
#Partition 5 [Row(key='d', value=2), Row(key='d', value=3)]
]

Why would Spark executors be removed (with "ExecutorAllocationManager: Request to remove executorIds" in the logs)?

Im trying to execute a spark job in an AWS cluster of 6 c4.2xlarge nodes and I don't know why Spark is killing the executors...
Any help will be appreciated
Here the spark submit command:
. /usr/bin/spark-submit --packages="com.databricks:spark-avro_2.11:3.2.0" --jars RedshiftJDBC42-1.2.1.1001.jar --deploy-mode client --master yarn --num-executors 12 --executor-cores 3 --executor-memory 7G --driver-memory 7g --py-files dependencies.zip iface_extractions.py 2016-10-01 > output.log
At line this line starts to remove executors
17/05/25 14:42:50 INFO ExecutorAllocationManager: Request to remove executorIds: 5, 3
Output spark-submit log:
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
:: loading settings :: url = jar:file:/usr/lib/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-avro_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-avro_2.11;3.2.0 in central
found org.slf4j#slf4j-api;1.7.5 in central
found org.apache.avro#avro;1.7.6 in central
found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in central
found com.thoughtworks.paranamer#paranamer;2.3 in central
found org.xerial.snappy#snappy-java;1.0.5 in central
found org.apache.commons#commons-compress;1.4.1 in central
found org.tukaani#xz;1.0 in central
:: resolution report :: resolve 284ms :: artifacts dl 8ms
:: modules in use:
com.databricks#spark-avro_2.11;3.2.0 from central in [default]
com.thoughtworks.paranamer#paranamer;2.3 from central in [default]
org.apache.avro#avro;1.7.6 from central in [default]
org.apache.commons#commons-compress;1.4.1 from central in [default]
org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
org.codehaus.jackson#jackson-mapper-asl;1.9.13 from central in [default]
org.slf4j#slf4j-api;1.7.5 from central in [default]
org.tukaani#xz;1.0 from central in [default]
org.xerial.snappy#snappy-java;1.0.5 from central in [default]
:: evicted modules:
org.slf4j#slf4j-api;1.6.4 by [org.slf4j#slf4j-api;1.7.5] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 10 | 0 | 0 | 1 || 9 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 9 already retrieved (0kB/8ms)
17/05/25 14:41:37 INFO SparkContext: Running Spark version 2.1.0
17/05/25 14:41:38 INFO SecurityManager: Changing view acls to: hadoop
17/05/25 14:41:38 INFO SecurityManager: Changing modify acls to: hadoop
17/05/25 14:41:38 INFO SecurityManager: Changing view acls groups to:
17/05/25 14:41:38 INFO SecurityManager: Changing modify acls groups to:
17/05/25 14:41:38 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
17/05/25 14:41:38 INFO Utils: Successfully started service 'sparkDriver' on port 37132.
17/05/25 14:41:38 INFO SparkEnv: Registering MapOutputTracker
17/05/25 14:41:38 INFO SparkEnv: Registering BlockManagerMaster
17/05/25 14:41:38 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/05/25 14:41:38 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/05/25 14:41:38 INFO DiskBlockManager: Created local directory at /mnt/tmp/blockmgr-e368a261-c1a1-49e7-8533-8081896a45e4
17/05/25 14:41:38 INFO MemoryStore: MemoryStore started with capacity 4.0 GB
17/05/25 14:41:38 INFO SparkEnv: Registering OutputCommitCoordinator
17/05/25 14:41:39 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/05/25 14:41:39 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.185.53.161:4040
17/05/25 14:41:39 INFO Utils: Using initial executors = 12, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/05/25 14:41:39 INFO RMProxy: Connecting to ResourceManager at ip-10-185-53-161.eu-west-1.compute.internal/10.185.53.161:8032
17/05/25 14:41:39 INFO Client: Requesting a new application from cluster with 5 NodeManagers
17/05/25 14:41:40 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (11520 MB per container)
17/05/25 14:41:40 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead
17/05/25 14:41:40 INFO Client: Setting up container launch context for our AM
17/05/25 14:41:40 INFO Client: Setting up the launch environment for our AM container
17/05/25 14:41:40 INFO Client: Preparing resources for our AM container
17/05/25 14:41:40 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
17/05/25 14:41:42 INFO Client: Uploading resource file:/mnt/tmp/spark-4f534fa1-c377-4113-9c86-96d5cdab4cb5/__spark_libs__6500399427935716229.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/__spark_libs__6500399427935716229.zip
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/RedshiftJDBC42-1.2.1.1001.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/RedshiftJDBC42-1.2.1.1001.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/com.databricks_spark-avro_2.11-3.2.0.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/com.databricks_spark-avro_2.11-3.2.0.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.slf4j_slf4j-api-1.7.5.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.avro_avro-1.7.6.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.apache.avro_avro-1.7.6.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.codehaus.jackson_jackson-core-asl-1.9.13.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/com.thoughtworks.paranamer_paranamer-2.3.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.5.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.xerial.snappy_snappy-java-1.0.5.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.apache.commons_commons-compress-1.4.1.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/.ivy2/jars/org.tukaani_xz-1.0.jar -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/org.tukaani_xz-1.0.jar
17/05/25 14:41:43 INFO Client: Uploading resource file:/etc/spark/conf/hive-site.xml -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/hive-site.xml
17/05/25 14:41:43 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/pyspark.zip
17/05/25 14:41:43 INFO Client: Uploading resource file:/usr/lib/spark/python/lib/py4j-0.10.4-src.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/py4j-0.10.4-src.zip
17/05/25 14:41:43 INFO Client: Uploading resource file:/home/hadoop/dependencies.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/dependencies.zip
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.databricks_spark-avro_2.11-3.2.0.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.slf4j_slf4j-api-1.7.5.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.avro_avro-1.7.6.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-core-asl-1.9.13.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.codehaus.jackson_jackson-mapper-asl-1.9.13.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/com.thoughtworks.paranamer_paranamer-2.3.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.xerial.snappy_snappy-java-1.0.5.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.apache.commons_commons-compress-1.4.1.jar added multiple times to distributed cache.
17/05/25 14:41:43 WARN Client: Same path resource file:/home/hadoop/.ivy2/jars/org.tukaani_xz-1.0.jar added multiple times to distributed cache.
17/05/25 14:41:43 INFO Client: Uploading resource file:/mnt/tmp/spark-4f534fa1-c377-4113-9c86-96d5cdab4cb5/__spark_conf__1516567354161750682.zip -> hdfs://ip-10-185-53-161.eu-west-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1495720658394_0004/__spark_conf__.zip
17/05/25 14:41:43 INFO SecurityManager: Changing view acls to: hadoop
17/05/25 14:41:43 INFO SecurityManager: Changing modify acls to: hadoop
17/05/25 14:41:43 INFO SecurityManager: Changing view acls groups to:
17/05/25 14:41:43 INFO SecurityManager: Changing modify acls groups to:
17/05/25 14:41:43 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
17/05/25 14:41:43 INFO Client: Submitting application application_1495720658394_0004 to ResourceManager
17/05/25 14:41:43 INFO YarnClientImpl: Submitted application application_1495720658394_0004
17/05/25 14:41:43 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1495720658394_0004 and attemptId None
17/05/25 14:41:44 INFO Client: Application report for application_1495720658394_0004 (state: ACCEPTED)
17/05/25 14:41:44 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1495723303463
final status: UNDEFINED
tracking URL: http://ip-10-185-53-161.eu-west-1.compute.internal:20888/proxy/application_1495720658394_0004/
user: hadoop
17/05/25 14:41:45 INFO Client: Application report for application_1495720658394_0004 (state: ACCEPTED)
17/05/25 14:41:46 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(null)
17/05/25 14:41:46 INFO Client: Application report for application_1495720658394_0004 (state: ACCEPTED)
17/05/25 14:41:46 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> ip-10-185-53-161.eu-west-1.compute.internal, PROXY_URI_BASES -> http://ip-10-185-53-161.eu-west-1.compute.internal:20888/proxy/application_1495720658394_0004), /proxy/application_1495720658394_0004
17/05/25 14:41:46 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
17/05/25 14:41:47 INFO Client: Application report for application_1495720658394_0004 (state: RUNNING)
17/05/25 14:41:47 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.185.52.31
ApplicationMaster RPC port: 0
queue: default
start time: 1495723303463
final status: UNDEFINED
tracking URL: http://ip-10-185-53-161.eu-west-1.compute.internal:20888/proxy/application_1495720658394_0004/
user: hadoop
17/05/25 14:41:47 INFO YarnClientSchedulerBackend: Application application_1495720658394_0004 has started running.
17/05/25 14:41:47 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37860.
17/05/25 14:41:47 INFO NettyBlockTransferService: Server created on 10.185.53.161:37860
17/05/25 14:41:47 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/05/25 14:41:47 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO BlockManagerMasterEndpoint: Registering block manager 10.185.53.161:37860 with 4.0 GB RAM, BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO BlockManager: external shuffle service port = 7337
17/05/25 14:41:47 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.185.53.161, 37860, None)
17/05/25 14:41:47 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1495720658394_0004
17/05/25 14:41:47 INFO Utils: Using initial executors = 12, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.52.31:57406) with ID 5
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 5 has registered (new total is 1)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-52-31.eu-west-1.compute.internal:38781 with 4.0 GB RAM, BlockManagerId(5, ip-10-185-52-31.eu-west-1.compute.internal, 38781, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.45:40096) with ID 3
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 3 has registered (new total is 2)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-45.eu-west-1.compute.internal:43702 with 4.0 GB RAM, BlockManagerId(3, ip-10-185-53-45.eu-west-1.compute.internal, 43702, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.135:42390) with ID 2
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 2 has registered (new total is 3)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-135.eu-west-1.compute.internal:41552 with 4.0 GB RAM, BlockManagerId(2, ip-10-185-53-135.eu-west-1.compute.internal, 41552, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.10:60612) with ID 1
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 1 has registered (new total is 4)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-10.eu-west-1.compute.internal:33391 with 4.0 GB RAM, BlockManagerId(1, ip-10-185-53-10.eu-west-1.compute.internal, 33391, None)
17/05/25 14:41:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.185.53.68:57424) with ID 4
17/05/25 14:41:50 INFO ExecutorAllocationManager: New executor 4 has registered (new total is 5)
17/05/25 14:41:50 INFO BlockManagerMasterEndpoint: Registering block manager ip-10-185-53-68.eu-west-1.compute.internal:34222 with 4.0 GB RAM, BlockManagerId(4, ip-10-185-53-68.eu-west-1.compute.internal, 34222, None)
17/05/25 14:42:09 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
17/05/25 14:42:09 INFO SharedState: Warehouse path is 'hdfs:///user/spark/warehouse'.
17/05/25 14:42:10 WARN Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
17/05/25 14:42:11 INFO CodeGenerator: Code generated in 170.416763 ms
17/05/25 14:42:11 INFO SparkContext: Starting job: collect at /home/hadoop/iface_extractions/select_fields.py:90
17/05/25 14:42:11 INFO DAGScheduler: Got job 0 (collect at /home/hadoop/iface_extractions/select_fields.py:90) with 1 output partitions
17/05/25 14:42:11 INFO DAGScheduler: Final stage: ResultStage 0 (collect at /home/hadoop/iface_extractions/select_fields.py:90)
17/05/25 14:42:11 INFO DAGScheduler: Parents of final stage: List()
17/05/25 14:42:11 INFO DAGScheduler: Missing parents: List()
17/05/25 14:42:11 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[2] at collect at /home/hadoop/iface_extractions/select_fields.py:90), which has no missing parents
17/05/25 14:42:11 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 7.5 KB, free 4.0 GB)
17/05/25 14:42:11 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 4.1 KB, free 4.0 GB)
17/05/25 14:42:11 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.185.53.161:37860 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:11 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
17/05/25 14:42:11 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[2] at collect at /home/hadoop/iface_extractions/select_fields.py:90)
17/05/25 14:42:11 INFO YarnScheduler: Adding task set 0.0 with 1 tasks
17/05/25 14:42:11 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, ip-10-185-53-135.eu-west-1.compute.internal, executor 2, partition 0, PROCESS_LOCAL, 5899 bytes)
17/05/25 14:42:11 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on ip-10-185-53-135.eu-west-1.compute.internal:41552 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1101 ms on ip-10-185-53-135.eu-west-1.compute.internal (executor 2) (1/1)
17/05/25 14:42:12 INFO YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/05/25 14:42:12 INFO DAGScheduler: ResultStage 0 (collect at /home/hadoop/iface_extractions/select_fields.py:90) finished in 1.109 s
17/05/25 14:42:12 INFO DAGScheduler: Job 0 finished: collect at /home/hadoop/iface_extractions/select_fields.py:90, took 1.290037 s
17/05/25 14:42:12 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 10.185.53.161:37860 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO SparkContext: Starting job: collect at /home/hadoop/iface_extractions/select_fields.py:91
17/05/25 14:42:12 INFO BlockManagerInfo: Removed broadcast_0_piece0 on ip-10-185-53-135.eu-west-1.compute.internal:41552 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO DAGScheduler: Got job 1 (collect at /home/hadoop/iface_extractions/select_fields.py:91) with 1 output partitions
17/05/25 14:42:12 INFO DAGScheduler: Final stage: ResultStage 1 (collect at /home/hadoop/iface_extractions/select_fields.py:91)
17/05/25 14:42:12 INFO DAGScheduler: Parents of final stage: List()
17/05/25 14:42:12 INFO DAGScheduler: Missing parents: List()
17/05/25 14:42:12 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at collect at /home/hadoop/iface_extractions/select_fields.py:91), which has no missing parents
17/05/25 14:42:12 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.5 KB, free 4.0 GB)
17/05/25 14:42:12 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.1 KB, free 4.0 GB)
17/05/25 14:42:12 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.185.53.161:37860 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:12 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:996
17/05/25 14:42:12 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at collect at /home/hadoop/iface_extractions/select_fields.py:91)
17/05/25 14:42:12 INFO YarnScheduler: Adding task set 1.0 with 1 tasks
17/05/25 14:42:12 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, ip-10-185-53-68.eu-west-1.compute.internal, executor 4, partition 0, PROCESS_LOCAL, 5900 bytes)
17/05/25 14:42:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-10-185-53-68.eu-west-1.compute.internal:34222 (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:14 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 1047 ms on ip-10-185-53-68.eu-west-1.compute.internal (executor 4) (1/1)
17/05/25 14:42:14 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/05/25 14:42:14 INFO DAGScheduler: ResultStage 1 (collect at /home/hadoop/iface_extractions/select_fields.py:91) finished in 1.047 s
17/05/25 14:42:14 INFO DAGScheduler: Job 1 finished: collect at /home/hadoop/iface_extractions/select_fields.py:91, took 1.054768 s
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 13.109425 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 12.568665 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 11.257538 ms
17/05/25 14:42:14 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 10.185.53.161:37860 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:14 INFO BlockManagerInfo: Removed broadcast_1_piece0 on ip-10-185-53-68.eu-west-1.compute.internal:34222 in memory (size: 4.1 KB, free: 4.0 GB)
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 11.563958 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 18.189301 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 13.490762 ms
17/05/25 14:42:14 INFO CodeGenerator: Code generated in 15.156166 ms
17/05/25 14:42:50 INFO ExecutorAllocationManager: Request to remove executorIds: 5, 3
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 5, 3
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 5, 3
17/05/25 14:42:50 INFO ExecutorAllocationManager: Removing executor 5 because it has been idle for 60 seconds (new desired total will be 4)
17/05/25 14:42:50 INFO ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 3)
17/05/25 14:42:50 INFO ExecutorAllocationManager: Request to remove executorIds: 1
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 1
17/05/25 14:42:50 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 1
17/05/25 14:42:50 INFO ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 2)
17/05/25 14:42:50 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 5.
17/05/25 14:42:50 INFO DAGScheduler: Executor lost: 5 (epoch 0)
17/05/25 14:42:50 INFO BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
17/05/25 14:42:50 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-185-52-31.eu-west-1.compute.internal, 38781, None)
17/05/25 14:42:50 INFO BlockManagerMaster: Removed 5 successfully in removeExecutor
17/05/25 14:42:50 INFO YarnScheduler: Executor 5 on ip-10-185-52-31.eu-west-1.compute.internal killed by driver.
17/05/25 14:42:50 INFO ExecutorAllocationManager: Existing executor 5 has been removed (new total is 4)
17/05/25 14:42:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1.
17/05/25 14:42:51 INFO DAGScheduler: Executor lost: 1 (epoch 0)
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, ip-10-185-53-10.eu-west-1.compute.internal, 33391, None)
17/05/25 14:42:51 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
17/05/25 14:42:51 INFO YarnScheduler: Executor 1 on ip-10-185-53-10.eu-west-1.compute.internal killed by driver.
17/05/25 14:42:51 INFO ExecutorAllocationManager: Existing executor 1 has been removed (new total is 3)
17/05/25 14:42:51 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
17/05/25 14:42:51 INFO DAGScheduler: Executor lost: 3 (epoch 0)
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
17/05/25 14:42:51 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, ip-10-185-53-45.eu-west-1.compute.internal, 43702, None)
17/05/25 14:42:51 INFO BlockManagerMaster: Removed 3 successfully in removeExecutor
17/05/25 14:42:51 INFO YarnScheduler: Executor 3 on ip-10-185-53-45.eu-west-1.compute.internal killed by driver.
17/05/25 14:42:51 INFO ExecutorAllocationManager: Existing executor 3 has been removed (new total is 2)
17/05/25 14:43:12 INFO ExecutorAllocationManager: Request to remove executorIds: 2
17/05/25 14:43:12 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 2
17/05/25 14:43:12 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 2
17/05/25 14:43:12 INFO ExecutorAllocationManager: Removing executor 2 because it has been idle for 60 seconds (new desired total will be 1)
17/05/25 14:43:13 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 2.
17/05/25 14:43:13 INFO DAGScheduler: Executor lost: 2 (epoch 0)
17/05/25 14:43:13 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
17/05/25 14:43:13 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, ip-10-185-53-135.eu-west-1.compute.internal, 41552, None)
17/05/25 14:43:13 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
17/05/25 14:43:13 INFO YarnScheduler: Executor 2 on ip-10-185-53-135.eu-west-1.compute.internal killed by driver.
17/05/25 14:43:13 INFO ExecutorAllocationManager: Existing executor 2 has been removed (new total is 1)
17/05/25 14:43:14 INFO ExecutorAllocationManager: Request to remove executorIds: 4
17/05/25 14:43:14 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 4
17/05/25 14:43:14 INFO YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 4
17/05/25 14:43:14 INFO ExecutorAllocationManager: Removing executor 4 because it has been idle for 60 seconds (new desired total will be 0)
17/05/25 14:43:17 INFO YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 4.
17/05/25 14:43:17 INFO DAGScheduler: Executor lost: 4 (epoch 0)
17/05/25 14:43:17 INFO BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
17/05/25 14:43:17 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(4, ip-10-185-53-68.eu-west-1.compute.internal, 34222, None)
17/05/25 14:43:17 INFO BlockManagerMaster: Removed 4 successfully in removeExecutor
17/05/25 14:43:17 INFO YarnScheduler: Executor 4 on ip-10-185-53-68.eu-west-1.compute.internal killed by driver.
17/05/25 14:43:17 INFO ExecutorAllocationManager: Existing executor 4 has been removed (new total is 0)
My guess is that you've got Dynamic Resource Allocation enabled in your Spark configuration.
Spark provides a mechanism to dynamically adjust the resources your application occupies based on the workload. This means that your application may give resources back to the cluster if they are no longer used and request them again later when there is demand. This feature is particularly useful if multiple applications share resources in your Spark cluster.
This feature is disabled by default and available on all coarse-grained cluster managers, i.e. standalone mode, YARN mode, and Mesos coarse-grained mode.
I highlighted the relevant part that says it is disabled by default and hence I can only guess that it was enabled.
From ExecutorAllocationManager:
An agent that dynamically allocates and removes executors based on the workload.
With that said, I'd use web UI and see if spark.dynamicAllocation.enabled property is enabled or not.
There are two requirements for using this feature (Dynamic Resource Allocation). First, your application must set spark.dynamicAllocation.enabled to true. Second, you must set up an external shuffle service on each worker node in the same cluster and set spark.shuffle.service.enabled to true in your application.
This is the line that prints out the INFO message:
logInfo("Request to remove executorIds: " + executors.mkString(", "))
You can also kill executors using SparkContext.killExecutors that gives a Spark developer a way to kill executors himself.
killExecutors(executorIds: Seq[String]): Boolean Request that the cluster manager kill the specified executors.
There are two killExecutors actually and they are very helpful for demo purposes as you can easily show how executors come and go.

Resources