spark-submit runs infinitely - displayed errors: Removal of executor n requested - Asked to remove non-existent executor n+1 - apache-spark

I have deployed a spark standalone cluster with one driver and 2 executors, each one running on a separate machine.
Whenever I submit a job to the master using spark-submit --master spark://driver_ip:7077 example/src/main/python/pi.py, It runs infinitely and displays these errors:
BlockManagerMaster:54 - Removal of executor 50 requested
CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asked to remove non-existent executor 50
BlockManagerMasterEndpoint:54 - Trying to remove executor 50 from BlockManagerMaster.
StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20181129123913-0003/52 is now RUNNING
StandaloneAppClient$ClientEndpoint:54 - Executor updated: app-20181129123913-0003/51 is now EXITED (Command exited with code 1)
StandaloneSchedulerBackend:54 - Executor app-20181129123913-0003/51 removed: Command exited with code 1
StandaloneAppClient$ClientEndpoint:54 - Executor added: app-20181129123913-0003/53 on worker-20181129120029-10.0.1.101-36599 (10.0.1.101:36599) with 1 core(s)
Each time the number in Removal of executor increments and the program doesn't end. It looks like the executors are constantly refusing the jobs.
Could anyone help me figure out the issue.
Note that I can see that the Spark executors are registered with the Driver In the Spark Manager's web UI.

Related

pyspark tasks stuck on Airflow and Spark Standalone Cluster with Docker-compose

I setup Airflow and Spark standalone cluster on docker-compose.
Airflow run spark-submit tasks via spark client mode, which are submitted directly to spark master. However when I execute spark-submit task, the task got stuck.
Spark-submit Command:
spark-submit --verbose --master spark:7077 --name dummy_sql_spark_job ${AIRFLOW_HOME}/dags/spark/spark_sql.py
What i see from spark-submit driver logs:
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220104070012-0011/1 is now EXITED (Command exited with code 1)
22/01/04 07:02:19 INFO StandaloneSchedulerBackend: Executor app-20220104070012-0011/1 removed: Command exited with code 1
22/01/04 07:02:19 INFO BlockManagerMaster: Removal of executor 1 requested
22/01/04 07:02:19 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
22/01/04 07:02:19 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20220104070012-0011/5 on worker-20220104061702-172.27.0.9-38453 (172.27.0.9:38453) with 1 core(s)
22/01/04 07:02:19 INFO StandaloneSchedulerBackend: Granted executor ID app-20220104070012-0011/5 on hostPort 172.27.0.9:38453 with 1 core(s), 1024.0 MiB RAM
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220104070012-0011/5 is now RUNNING
22/01/04 07:02:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:02:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:02:58 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
What i see from one of the spark workers:
spark-worker-1_1 | 22/01/04 07:02:18 INFO SecurityManager: Changing modify acls groups to:
spark-worker-1_1 | 22/01/04 07:02:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
spark-worker-1_1 | 22/01/04 07:02:19 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=5001" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#172.27.0.6:5001" "--executor-id" "3" "--hostname" "172.27.0.11" "--cores" "1" "--app-id" "app-20220104070012-0011" "--worker-url" "spark://Worker#172.27.0.11:35093"
Versions:
Airflow image: apache/airflow:2.2.3
Spark driver version: 3.1.2
Spark server: 3.2.0
Network
All containers airflow-scheduler, airflow-webserver, spark-master, spark-worker-n connected to same external network.
spark-driver is installed under airflow containers (scheduler, webserver), because corresponding dags and tasks are executed by airflow-scheduler.
UPDATE
After replacing driver spark version to match the master's one 3.2.0, the issue get disappeared. So it means, that in my particular case the issue was not due to connectivity between different spark actors (driver, master, worker/executor), but due to version mismatch. For some reason spark workers does not log corresponding error, which is misleading.
Most of the threads was pointing to connectivity issues. However in my case issue was due to mismatch of spark's driver vs master/worker version.
After replacing driver spark version to match the master's one 3.2.0, as well as ensure the same python version both on driver and executor sides (3.9.10) the issue get disappeared. So it means, that in my particular case the issue was not due to connectivity between different spark actors (driver, master, worker/executor), but due to version mismatch. For some reason spark workers does not log corresponding error, which is misleading.

Application failed 2 times due to AM Container, exited with exitcode -104

I am running a Spark application with two input files and a jar file which is taken up from Amazon S3 bucket. I am creating a cluster using AWS CLI with instance type as m5.12xlarge and instance-count as 11 and spark properties as:
--deploy-mode cluster
--num-executors 10
--executor-cores 45
--executor-memory 155g
My spark job was running for some time and then it failed and restarted automatically and it ran again for some time and then it showed this diagnostics (pulled from the logs)
diagnostics: Application application_1557259242251_0001 failed 2 times due to AM Container for appattempt_1557259242251_0001_000002 exited with exitCode: -104
Failing this attempt.Diagnostics: Container [pid=11779,containerID=container_1557259242251_0001_02_000001] is running beyond physical memory limits. Current usage: 1.4 GB of 1.4 GB physical memory used; 3.5 GB of 6.9 GB virtual memory used. Killing container.
Dump of the process-tree for container_1557259242251_0001_02_000001 :
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Exception in thread "main" org.apache.spark.SparkException: Application application_1557259242251_0001 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1165)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
19/05/07 20:03:35 INFO ShutdownHookManager: Shutdown hook called
19/05/07 20:03:35 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-3deea823-45e5-4a11-a5ff-833b01e6ae79
19/05/07 20:03:35 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-d6c3f8b2-34c6-422b-b946-ad03b1ee77d6
Command exiting with ret '1'
I am not able to figure out what is the problem?
I have tried change the instance type or lowering the executor memory and executor-cores but still the same problem keep on occuring.
Sometimes the same configuration settings terminates the cluster successfully and results are generated but many time these error are generated.
Can someone please help?
If you are providing more than 1 input file to the spark job. Make a jar and then execute it.
Step 1: How to make a zip file
zip abc.zip file1.py file2.py
Step 2: Execute job with a zip file
spark2-submit --master yarn --deploy-mode cluster --py-files /home/abc.zip /home/main_program_file.py

Why am I getting "Removing worker because we got no heartbeat in 60 seconds" on Spark master

I think I might of stumbled across a bug and wanted to get other people's input. I am running a pyspark application using Spark 2.2.0 in standalone mode. I am doing a somewhat heavy transformation in python inside a flatMap and the driver keeps killing the workers.
Here is what am I seeing:
The master after 60s of not seeing any heartbeat message from the workers it prints out this message to the log:
Removing worker [worker name] because we got no heartbeat in 60
seconds
Removing worker [worker name] on [IP]:[port]
Telling app of
lost executor: [executor number]
I then see in the driver log the following message:
Lost executor [executor number] on [executor IP]: worker lost
The worker then terminates and I see this message in its log:
Driver commanded a shutdown
I have looked at the Spark source code and from what I can tell, as long as the executor is alive it should send a heartbeat message back as it is using a ThreadUtils.newDaemonSingleThreadScheduledExecutor to do this.
One other thing that I noticed while I was running top on one of the workers, is that the executor JVM seems to be suspended throughout this process. There are as many python processes as I specified in the SPARK_WORKER_CORES env variable and each is consuming close to 100% of the CPU.
Anyone have any thoughts on this?
I was facing this same issue, increasing interval worked.
Excerpt from Logs start-all.sh logs
INFO Utils: Successfully started service 'sparkMaster' on port 7077.
INFO Master: Starting Spark master at spark://master:7077
INFO Master: Running Spark version 3.0.1
INFO Utils: Successfully started service 'MasterUI' on port 8080.
INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://master:8080
INFO Master: I have been elected leader! New state: ALIVE
INFO Master: Registering worker slave01:41191 with 16 cores, 15.7 GiB RAM
INFO Master: Registering worker slave02:37853 with 16 cores, 15.7 GiB RAM
WARN Master: Removing worker-20210618205117-slave01-41191 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618205117-slave01-41191 on slave01:41191
INFO Master: Telling app of lost worker: worker-20210618205117-slave01-41191
WARN Master: Removing worker-20210618204723-slave02-37853 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618204723-slave02-37853 on slave02:37853
INFO Master: Telling app of lost worker: worker-20210618204723-slave02-37853
WARN Master: Got heartbeat from unregistered worker worker-20210618205117-slave01-41191. This worker was never registered, so ignoring the heartbeat.
WARN Master: Got heartbeat from unregistered worker worker-20210618204723-slave02-37853. This worker was never registered, so ignoring the heartbeat.
Solution: add following configs to $SPARK_HOME/conf/spark-defaults.conf
spark.network.timeout 50000
spark.executor.heartbeatInterval 5000
spark.worker.timeout 5000

spark tasks fail with error, showing exit status: -100

The spark job running in yarn mode, shows few tasks failed with following reason:
ExecutorLostFailure (executor 36 exited caused by one of the running tasks) Reason: Container marked as failed: container_xxxxxxxxxx_yyyy_01_000054 on host: ip-xxx-yy-zzz-zz. Exit status: -100. Diagnostics: Container released on a *lost* node
Any idea why is this happening?
There are two main reasons.
It is may because of your memoryOverhead needed by the yarn container is not enough, and the solution is to Increase the spark.executor.memoryOverhead
Possibly, it is because the slave node disk lack space to write tmp data required by spark. check your yarn usercache dir (for EMR, it locates on /mnt/yarn/usercache/),
or type df -h to check your disk remaining space.
Container killed by the framework, either due to being released by the application or being 'lost' due to node failures etc. have a special exit code of -100.
The node failure could be because of not having enough disc space or executor memory.
I understand your cluster is not on AWS but as AWS manager the MR cluster they have released an FAQ
For Glue job: https://aws.amazon.com/premiumsupport/knowledge-center/container-released-lost-node-100-glue/
For EMR: https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/

Spark streaming job fails after getting stopped by Driver

I have a spark streaming job which reads in data from Kafka and does some operations on it. I am running the job over a yarn cluster, Spark 1.4.1, which has two nodes with 16 GB RAM each and 16 cores each.
I have these conf passed to the spark-submit job :
--master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 3
The job returns this error and finishes after running for a short while :
INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11,
(reason: Max number of executor failures reached)
.....
ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0:
Stopped by driver
Updated :
These logs were found too :
INFO yarn.YarnAllocator: Received 3 containers from YARN, launching executors on 3 of them.....
INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down.
....
INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
INFO yarn.ExecutorRunnable: Starting Executor Container.....
INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down...
INFO yarn.YarnAllocator: Completed container container_e10_1453801197604_0104_01_000006 (state: COMPLETE, exit status: 1)
INFO yarn.YarnAllocator: Container marked as failed: container_e10_1453801197604_0104_01_000006. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e10_1453801197604_0104_01_000006
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
What might be the reasons for this? Appreciate some help.
Thanks
can you please show your scala/java code that is reading from kafka? I suspect you probably not creating your SparkConf correctly.
Try something like
SparkConf sparkConf = new SparkConf().setAppName("ApplicationName");
also try running application in yarn-client mode and share the output.
I got the same issue. and I have found 1 solution to fix the issue by removing sparkContext.stop() at the end of main function, leave the stop action for GC.
Spark team has resolved the issue in Spark core, however, the fix has just been master branch so far. We need to wait until the fix has been updated into the new release.
https://issues.apache.org/jira/browse/SPARK-12009

Resources