Spark with GPUs: How to force 1 task per executor - apache-spark

I have Spark 2.1.0 running on a cluster with N slave nodes. Each node has 16 cores (8 cores/cpu and 2 cpus) and 1 GPU. I want to use the map process to launch a GPU kernel. Since there is only 1 GPU per node, I need to ensure that two executors are not on the same node (at the same time) trying to use the GPU and that two tasks are not submitted to the same executor at the same time.
How can I force Spark to have one executor per node?
I have tried the following:
--Setting: spark.executor.cores 16 in $SPARK_HOME/conf/spark-defaults.conf
--Setting: SPARK_WORKER_CORES = 16 and SPARK_WORKER_INSTANCES = 1 in $SPARK_HOME/conf/spark-env.sh
and,
--Setting conf = SparkConf().set('spark.executor.cores', 16).set('spark.executor.instances', 6) directly in my spark script (when I wanted N=6 for debugging purposes).
These options create 6 executors on different nodes as desired, but it seems that each task is assigned to the same executor.
Here are some snippets from my most the most recent output (which lead me to believe it should be working as I want).
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/0 on worker-20170217110853-10.128.14.208-35771 (10.128.14.208:35771) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/0 on hostPort 10.128.14.208:35771 with 16 cores, 16.0 GB RAM
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/1 on worker-20170217110853-10.128.9.95-59294 (10.128.9.95:59294) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/1 on hostPort 10.128.9.95:59294 with 16 cores, 16.0 GB RAM
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/2 on worker-20170217110853-10.128.3.71-47507 (10.128.3.71:47507) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/2 on hostPort 10.128.3.71:47507 with 16 cores, 16.0 GB RAM
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/3 on worker-20170217110853-10.128.9.96-50800 (10.128.9.96:50800) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/3 on hostPort 10.128.9.96:50800 with 16 cores, 16.0 GB RAM
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/4 on worker-20170217110853-10.128.3.73-60194 (10.128.3.73:60194) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/4 on hostPort 10.128.3.73:60194 with 16 cores, 16.0 GB RAM
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20170217110910-0000/5 on worker-20170217110853-10.128.3.74-42793 (10.128.3.74:42793) with 16 cores
17/02/17 11:09:10 INFO StandaloneSchedulerBackend: Granted executor ID app-20170217110910-0000/5 on hostPort 10.128.3.74:42793 with 16 cores, 16.0 GB RAM
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/1 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/3 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/4 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/2 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/0 is now RUNNING
17/02/17 11:09:10 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170217110910-0000/5 is now RUNNING
17/02/17 11:09:11 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
My RDD has 6 partitions.
The important thing is that 6 Executors were started, each with a different IP address and each getting 16 cores (exactly what I expected). The phrase My RDD has 6 partitions. is a print statement from my code after repartitioning my RDD (to make sure I had 1 partition per executor).
Then, THIS happens... each of the 6 tasks are sent to the same executor!
17/02/17 11:09:12 INFO TaskSchedulerImpl: Adding task set 0.0 with 6 tasks
17/02/17 11:09:17 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(null) (10.128.9.95:34059) with ID 1
17/02/17 11:09:17 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, 10.128.9.95, executor 1, partition 0, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, 10.128.9.95, executor 1, partition 1, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, 10.128.9.95, executor 1, partition 2, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, 10.128.9.95, executor 1, partition 3, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, 10.128.9.95, executor 1, partition 4, PROCESS_LOCAL, 6095 bytes)
17/02/17 11:09:17 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, 10.128.9.95, executor 1, partition 5, PROCESS_LOCAL, 6095 bytes)
Why? and How can I fix it? The problem is that at this point, all 6 tasks compete for the same GPU and the GPU cannot be shared.

I tried the suggestion in the comments of Samson Scharfrichter, but they didn't seem to work. However, I found: http://spark.apache.org/docs/latest/configuration.html#scheduling which includes spark.task.cpus. If I set that to 16 and spark.executor.cores to 16 then I appear to get one task assigned to each executor.

Related

can driver program and cluster manager(resource manager) can available on same machine in spark stand alone

I'm doing spark submit from the same machine as spark master, using following command ./bin/spark-submit --master spark://ip:port --deploy-mode "client" test.py I'm my application running forever with following kind of output
22/11/18 13:17:37 INFO BlockManagerMaster: Removal of executor 8 requested
22/11/18 13:17:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 8
22/11/18 13:17:37 INFO StandaloneSchedulerBackend: Granted executor ID app-20221118131723-0008/10 on hostPort 192.168.210.94:37443 with 2 core(s), 1024.0 MiB RAM
22/11/18 13:17:37 INFO BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster.
22/11/18 13:17:37 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221118131723-0008/10 is now RUNNING
22/11/18 13:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221118131723-0008/9 is now EXITED (Command exited with code 1)
22/11/18 13:17:38 INFO StandaloneSchedulerBackend: Executor app-20221118131723-0008/9 removed: Command exited with code 1
22/11/18 13:17:38 INFO BlockManagerMaster: Removal of executor 9 requested
22/11/18 13:17:38 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 9
22/11/18 13:17:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 9 from BlockManagerMaster.
22/11/18 13:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221118131723-0008/11 on worker-20221118111836-192.168.210.82-46395 (192.168.210.82:4639
But when I run from other nodes, my application is running successfully what could be reason?

Spark Streaming integration with Kinesis not receiving records in EMR

I'm trying to run the word count example described here, but the DStream reading from the Kinesis stream is always empty.
This is how I'm running:
Launched an AWS EMR cluster in version 6.5.0 (Running spark 3.1.2)
SSHed into the master instance
ran: spark-example --packages org.apache.spark:spark-streaming-kinesis-asl_2.12:3.1.2 streaming.JavaKinesisWordCountASL streaming_test streaming_test https://kinesis.sa-east-1.amazonaws.com
In another tab, ran: spark-example --packages org.apache.spark:spark-streaming-kinesis-asl_2.12:3.1.2 streaming.KinesisWordProducerASL streaming-test https://kinesis.sa-east-1.amazonaws.com 100 10
Additional info:
EMR cluster with 2 m5.xlarge instances
Kinesis with a single shard only
I can fetch records from the stream using boto3
A DynamoDB table was indeed created for storing checkpoints, but nothing was written on it
Logs (This is just a sample - after it finishes initializing, it keeps repeating that pattern of pprint with no records, followed by a bunch of spark related logs, then followed again by another pprint with no records)
GiB)
22/01/27 21:39:46 INFO TaskSetManager: Starting task 1.0 in stage 8.0 (TID 77) (ip-10-0-13-187.sa-east-1.compute.internal, executor 1, partition 6, PROCESS_LOCAL, 4443 bytes) taskResourceAssignments Map()
22/01/27 21:39:46 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 76) in 19 ms on ip-10-0-13-187.sa-east-1.compute.internal (executor 1) (1/3)
22/01/27 21:39:46 INFO TaskSetManager: Starting task 2.0 in stage 8.0 (TID 78) (ip-10-0-13-187.sa-east-1.compute.internal, executor 1, partition 7, PROCESS_LOCAL, 4443 bytes) taskResourceAssignments Map()
22/01/27 21:39:46 INFO TaskSetManager: Finished task 1.0 in stage 8.0 (TID 77) in 10 ms on ip-10-0-13-187.sa-east-1.compute.internal (executor 1) (2/3)
22/01/27 21:39:46 INFO TaskSetManager: Finished task 2.0 in stage 8.0 (TID 78) in 8 ms on ip-10-0-13-187.sa-east-1.compute.internal (executor 1) (3/3)
22/01/27 21:39:46 INFO YarnScheduler: Removed TaskSet 8.0, whose tasks have all completed, from pool
22/01/27 21:39:46 INFO DAGScheduler: ResultStage 8 (print at JavaKinesisWordCountASL.java:190) finished in 0,042 s
22/01/27 21:39:46 INFO DAGScheduler: Job 4 is finished. Cancelling potential speculative or zombie tasks for this job
22/01/27 21:39:46 INFO YarnScheduler: Killing all running tasks in stage 8: Stage finished
22/01/27 21:39:46 INFO DAGScheduler: Job 4 finished: print at JavaKinesisWordCountASL.java:190, took 0,048372 s
-------------------------------------------
Time: 1643319586000 ms
-------------------------------------------
22/01/27 21:39:46 INFO JobScheduler: Finished job streaming job 1643319586000 ms.0 from job set of time 1643319586000 ms
22/01/27 21:39:46 INFO JobScheduler: Total delay: 0,271 s for time 1643319586000 ms (execution: 0,227 s)
22/01/27 21:39:46 INFO ReceivedBlockTracker: Deleting batches:
Also, apparently, the Library does manage to connect to the Kinesis stream:
22/01/27 21:39:44 INFO KinesisInputDStream: Slide time = 2000 ms
22/01/27 21:39:44 INFO KinesisInputDStream: Storage level = Serialized 1x Replicated
22/01/27 21:39:44 INFO KinesisInputDStream: Checkpoint interval = null
22/01/27 21:39:44 INFO KinesisInputDStream: Remember interval = 2000 ms
22/01/27 21:39:44 INFO KinesisInputDStream: Initialized and validated org.apache.spark.streaming.kinesis.KinesisInputDStream#7cc3580b
Help would be very appreciated!

Spark: Jobs are not assigned

I've deployed an spark cluster into my kubernetes. Here webui:
I'm trying to submit an sark SparkPi example using:
$ ./spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://spark-cluster-ra-iot-dev.si-origin-cluster.t-systems.es:32316 \
--num-executors 1 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
../examples/jars/spark-examples_2.11-2.4.5.jar 10
Job is reached on spark cluster:
Nevertheless, I'm getting messages like:
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I seems like SparkPi application is scheduled but never executed...
Here complete log:
./spark-submit --class org.apache.spark.examples.SparkPi --master spark://spark-cluster-ra-iot-dev.si-origin-cluster.t-systems.es:32316 --num-executors 1 --driver-memory 512m --executor-memory 512m --executor-cores 1 ../examples/jars/spark-examples_2.11-2.4.5.jar 10
20/06/09 10:52:57 WARN Utils: Your hostname, psgd resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
20/06/09 10:52:57 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/06/09 10:52:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/06/09 10:52:58 INFO SparkContext: Running Spark version 2.4.5
20/06/09 10:52:58 INFO SparkContext: Submitted application: Spark Pi
20/06/09 10:52:58 INFO SecurityManager: Changing view acls to: jeusdi
20/06/09 10:52:58 INFO SecurityManager: Changing modify acls to: jeusdi
20/06/09 10:52:58 INFO SecurityManager: Changing view acls groups to:
20/06/09 10:52:58 INFO SecurityManager: Changing modify acls groups to:
20/06/09 10:52:58 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jeusdi); groups with view permissions: Set(); users with modify permissions: Set(jeusdi); groups with modify permissions: Set()
20/06/09 10:52:59 INFO Utils: Successfully started service 'sparkDriver' on port 42943.
20/06/09 10:52:59 INFO SparkEnv: Registering MapOutputTracker
20/06/09 10:52:59 INFO SparkEnv: Registering BlockManagerMaster
20/06/09 10:52:59 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
20/06/09 10:52:59 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
20/06/09 10:52:59 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-b6c54054-c94b-42c7-b85f-a4e30be4b659
20/06/09 10:52:59 INFO MemoryStore: MemoryStore started with capacity 117.0 MB
20/06/09 10:52:59 INFO SparkEnv: Registering OutputCommitCoordinator
20/06/09 10:52:59 INFO Utils: Successfully started service 'SparkUI' on port 4040.
20/06/09 10:53:00 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.2.15:4040
20/06/09 10:53:00 INFO SparkContext: Added JAR file:/home/jeusdi/projects/workarea/valladolid/spark-2.4.5-bin-hadoop2.7/bin/../examples/jars/spark-examples_2.11-2.4.5.jar at spark://10.0.2.15:42943/jars/spark-examples_2.11-2.4.5.jar with timestamp 1591692780146
20/06/09 10:53:00 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://spark-cluster-ra-iot-dev.si-origin-cluster.t-systems.es:32316...
20/06/09 10:53:00 INFO TransportClientFactory: Successfully created connection to spark-cluster-ra-iot-dev.si-origin-cluster.t-systems.es/10.49.160.69:32316 after 152 ms (0 ms spent in bootstraps)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20200609085300-0002
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/0 on worker-20200609084543-10.129.3.127-45867 (10.129.3.127:45867) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/0 on hostPort 10.129.3.127:45867 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/1 on worker-20200609084543-10.129.3.127-45867 (10.129.3.127:45867) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/1 on hostPort 10.129.3.127:45867 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/2 on worker-20200609084543-10.129.3.127-45867 (10.129.3.127:45867) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/2 on hostPort 10.129.3.127:45867 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/3 on worker-20200609084543-10.129.3.127-45867 (10.129.3.127:45867) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/3 on hostPort 10.129.3.127:45867 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/4 on worker-20200609084543-10.129.3.127-45867 (10.129.3.127:45867) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/4 on hostPort 10.129.3.127:45867 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 33755.
20/06/09 10:53:01 INFO NettyBlockTransferService: Server created on 10.0.2.15:33755
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/5 on worker-20200609084509-10.128.3.197-41600 (10.128.3.197:41600) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/5 on hostPort 10.128.3.197:41600 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/6 on worker-20200609084509-10.128.3.197-41600 (10.128.3.197:41600) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/6 on hostPort 10.128.3.197:41600 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/7 on worker-20200609084509-10.128.3.197-41600 (10.128.3.197:41600) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/7 on hostPort 10.128.3.197:41600 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/8 on worker-20200609084509-10.128.3.197-41600 (10.128.3.197:41600) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/8 on hostPort 10.128.3.197:41600 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/9 on worker-20200609084509-10.128.3.197-41600 (10.128.3.197:41600) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/9 on hostPort 10.128.3.197:41600 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/10 on worker-20200609084426-10.131.1.27-46041 (10.131.1.27:46041) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/10 on hostPort 10.131.1.27:46041 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/11 on worker-20200609084426-10.131.1.27-46041 (10.131.1.27:46041) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/11 on hostPort 10.131.1.27:46041 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/12 on worker-20200609084426-10.131.1.27-46041 (10.131.1.27:46041) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/12 on hostPort 10.131.1.27:46041 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/13 on worker-20200609084426-10.131.1.27-46041 (10.131.1.27:46041) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/13 on hostPort 10.131.1.27:46041 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/14 on worker-20200609084426-10.131.1.27-46041 (10.131.1.27:46041) with 1 core(s)
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/14 on hostPort 10.131.1.27:46041 with 1 core(s), 512.0 MB RAM
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/5 is now RUNNING
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/6 is now RUNNING
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/7 is now RUNNING
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/8 is now RUNNING
20/06/09 10:53:01 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.2.15, 33755, None)
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/9 is now RUNNING
20/06/09 10:53:01 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.2.15:33755 with 117.0 MB RAM, BlockManagerId(driver, 10.0.2.15, 33755, None)
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/10 is now RUNNING
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/11 is now RUNNING
20/06/09 10:53:01 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.2.15, 33755, None)
20/06/09 10:53:01 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.2.15, 33755, None)
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/12 is now RUNNING
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/13 is now RUNNING
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/14 is now RUNNING
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/0 is now RUNNING
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/1 is now RUNNING
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/2 is now RUNNING
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/3 is now RUNNING
20/06/09 10:53:01 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/4 is now RUNNING
20/06/09 10:53:01 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
20/06/09 10:53:02 INFO SparkContext: Starting job: reduce at SparkPi.scala:38
20/06/09 10:53:02 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 10 output partitions
20/06/09 10:53:02 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
20/06/09 10:53:02 INFO DAGScheduler: Parents of final stage: List()
20/06/09 10:53:02 INFO DAGScheduler: Missing parents: List()
20/06/09 10:53:02 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
20/06/09 10:53:03 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 2.0 KB, free 117.0 MB)
20/06/09 10:53:03 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1381.0 B, free 117.0 MB)
20/06/09 10:53:03 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.2.15:33755 (size: 1381.0 B, free: 117.0 MB)
20/06/09 10:53:03 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1163
20/06/09 10:53:03 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9))
20/06/09 10:53:03 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
20/06/09 10:53:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
20/06/09 10:53:33 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
20/06/09 10:53:48 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
20/06/09 10:54:03 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
20/06/09 10:54:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
20/06/09 10:54:33 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
20/06/09 10:54:48 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/13 is now EXITED (Command exited with code 1)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Executor app-20200609085300-0002/13 removed: Command exited with code 1
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/15 on worker-20200609084426-10.131.1.27-46041 (10.131.1.27:46041) with 1 core(s)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/15 on hostPort 10.131.1.27:46041 with 1 core(s), 512.0 MB RAM
20/06/09 10:55:03 INFO BlockManagerMasterEndpoint: Trying to remove executor 13 from BlockManagerMaster.
20/06/09 10:55:03 INFO BlockManagerMaster: Removal of executor 13 requested
20/06/09 10:55:03 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 13
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/15 is now RUNNING
20/06/09 10:55:03 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/12 is now EXITED (Command exited with code 1)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Executor app-20200609085300-0002/12 removed: Command exited with code 1
20/06/09 10:55:03 INFO BlockManagerMaster: Removal of executor 12 requested
20/06/09 10:55:03 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 12
20/06/09 10:55:03 INFO BlockManagerMasterEndpoint: Trying to remove executor 12 from BlockManagerMaster.
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/16 on worker-20200609084426-10.131.1.27-46041 (10.131.1.27:46041) with 1 core(s)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/16 on hostPort 10.131.1.27:46041 with 1 core(s), 512.0 MB RAM
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/16 is now RUNNING
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/14 is now EXITED (Command exited with code 1)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Executor app-20200609085300-0002/14 removed: Command exited with code 1
20/06/09 10:55:03 INFO BlockManagerMaster: Removal of executor 14 requested
20/06/09 10:55:03 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 14
20/06/09 10:55:03 INFO BlockManagerMasterEndpoint: Trying to remove executor 14 from BlockManagerMaster.
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/17 on worker-20200609084426-10.131.1.27-46041 (10.131.1.27:46041) with 1 core(s)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/17 on hostPort 10.131.1.27:46041 with 1 core(s), 512.0 MB RAM
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/17 is now RUNNING
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/10 is now EXITED (Command exited with code 1)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Executor app-20200609085300-0002/10 removed: Command exited with code 1
20/06/09 10:55:03 INFO BlockManagerMaster: Removal of executor 10 requested
20/06/09 10:55:03 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 10
20/06/09 10:55:03 INFO BlockManagerMasterEndpoint: Trying to remove executor 10 from BlockManagerMaster.
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/18 on worker-20200609084426-10.131.1.27-46041 (10.131.1.27:46041) with 1 core(s)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/18 on hostPort 10.131.1.27:46041 with 1 core(s), 512.0 MB RAM
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/18 is now RUNNING
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/8 is now EXITED (Command exited with code 1)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Executor app-20200609085300-0002/8 removed: Command exited with code 1
20/06/09 10:55:03 INFO BlockManagerMaster: Removal of executor 8 requested
20/06/09 10:55:03 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 8
20/06/09 10:55:03 INFO BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster.
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/19 on worker-20200609084509-10.128.3.197-41600 (10.128.3.197:41600) with 1 core(s)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/19 on hostPort 10.128.3.197:41600 with 1 core(s), 512.0 MB RAM
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/11 is now EXITED (Command exited with code 1)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Executor app-20200609085300-0002/11 removed: Command exited with code 1
20/06/09 10:55:03 INFO BlockManagerMasterEndpoint: Trying to remove executor 11 from BlockManagerMaster.
20/06/09 10:55:03 INFO BlockManagerMaster: Removal of executor 11 requested
20/06/09 10:55:03 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 11
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/20 on worker-20200609084426-10.131.1.27-46041 (10.131.1.27:46041) with 1 core(s)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/20 on hostPort 10.131.1.27:46041 with 1 core(s), 512.0 MB RAM
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/19 is now RUNNING
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/20 is now RUNNING
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/7 is now EXITED (Command exited with code 1)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Executor app-20200609085300-0002/7 removed: Command exited with code 1
20/06/09 10:55:03 INFO BlockManagerMaster: Removal of executor 7 requested
20/06/09 10:55:03 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 7
20/06/09 10:55:03 INFO BlockManagerMasterEndpoint: Trying to remove executor 7 from BlockManagerMaster.
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/21 on worker-20200609084509-10.128.3.197-41600 (10.128.3.197:41600) with 1 core(s)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Granted executor ID app-20200609085300-0002/21 on hostPort 10.128.3.197:41600 with 1 core(s), 512.0 MB RAM
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/21 is now RUNNING
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20200609085300-0002/9 is now EXITED (Command exited with code 1)
20/06/09 10:55:03 INFO StandaloneSchedulerBackend: Executor app-20200609085300-0002/9 removed: Command exited with code 1
20/06/09 10:55:03 INFO BlockManagerMaster: Removal of executor 9 requested
20/06/09 10:55:03 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 9
20/06/09 10:55:03 INFO BlockManagerMasterEndpoint: Trying to remove executor 9 from BlockManagerMaster.
20/06/09 10:55:03 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20200609085300-0002/22 on worker-20200609084509-10.128.3.197-41600 (10.128.3.197:41600) with 1 core(s)
...

Apache Spark driver logs don't specify reason of stage cancelling

I run Apache Spark on AWS EMR under YARN.
The cluster has 1 master and 10 executors.
After some hours of processing my cluster failed and I go to look on a log.
So, I see that all working executors were trying to kill task at one time (It's the log of someone executor):
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 66.0 in stage 2.0 (TID 466), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 65.0 in stage 2.0 (TID 465), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 67.0 in stage 2.0 (TID 467), reason: Stage cancelled
20/03/05 00:02:12 INFO Executor: Executor is trying to kill task 64.0 in stage 2.0 (TID 464), reason: Stage cancelled
20/03/05 00:02:12 ERROR Utils: Aborting a task
I see that reason is Stage cancelled but I can't get any details about that. I see driver logs and find that they have the last record at much earlier time.
So I have 2 questions:
Why driver logs are much shorter than executors logs?
How can I get the real reason why stage cancelled?
20/03/04 18:39:40 INFO TaskSetManager: Starting task 159.0 in stage 1.0 (TID 359, ip-172-31-6-236.us-west-2.compute.internal, executor 40, partition 159, RACK_LOCAL, 8421 bytes)
20/03/04 18:39:40 INFO ExecutorAllocationManager: New executor 40 has registered (new total is 40)
20/03/04 18:39:41 INFO BlockManagerMasterEndpoint: Registering block manager ip-172-31-6-236.us-west-2.compute.internal:33589 with 2.8 GB RAM, BlockManagerId(40, ip-172-31-6-236.us-west-2.compute.internal, 33589, None)
20/03/04 18:39:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on ip-172-31-6-236.us-west-2.compute.internal:33589 (size: 44.7 KB, free: 2.8 GB)
20/03/04 18:39:48 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on ip-172-31-6-236.us-west-2.compute.internal:33589 (size: 37.4 KB, free: 2.8 GB)

Spark executor lost when increasing the number of executor instances

My Hadoop cluster currently has 4 nodes and 45 cores running pyspark 2.4 through YARN. When I run spark-submit with one executor everything works fine, but if I change the number of executor-instances to 3 or 4 the executor is killed by the driver and only one task is working.
I have changed the below settings on Cloudera manager:
yarn.nodemanager.resource.memory-mb : 64 GB
yarn.nodemanager.resource.cpu-vcores:45
And below is the log that I get:
19/03/21 11:28:48 INFO cluster.YarnScheduler: Adding task set 0.0 with 1 tasks
19/03/21 11:28:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, datanode1, executor 2, partition 0, PROCESS_LOCAL, 7701 bytes)
19/03/21 11:28:48 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on datanode1:42432 (size: 71.0 KB, free: 366.2 MB)
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 1, 3
19/03/21 11:29:43 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 1, 3
19/03/21 11:29:43 INFO cluster.YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 1, 3
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 2)
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 1)
19/03/21 11:29:45 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
19/03/21 11:29:45 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch 0)
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, datanode2, 32853, None)
19/03/21 11:29:45 INFO storage.BlockManagerMaster: Removed 3 successfully in removeExecutor
19/03/21 11:29:45 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1.
19/03/21 11:29:45 INFO scheduler.DAGScheduler: Executor lost: 1 (epoch 0)
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, datanode3, 39466, None)
19/03/21 11:29:45 INFO storage.BlockManagerMaster: Removed 1 successfully in removeExecutor
19/03/21 11:29:45 INFO cluster.YarnScheduler: Executor 3 on datanode2 killed by driver.
19/03/21 11:29:45 INFO cluster.YarnScheduler: Executor 1 on datanode3 killed by driver.
19/03/21 11:29:45 INFO spark.ExecutorAllocationManager: Existing executor 3 has been removed (new total is 2)
19/03/21 11:29:45 INFO spark.ExecutorAllocationManager: Existing executor 1 has been removed (new total is 1)

Resources