WARN - Running Spark Locally with Docker - Initial job has not accepted any resources - apache-spark

I launched Spark master and worker in my laptop using Docker bridge network spark
docker network create spark
I put the following command
docker run -ti -p 8080:8080 -p 7077:7077 -p 4040:4040 -e SPARK_NO_DAEMONIZE=true --network=spark --name spark-master apache/spark:v3.3.0 /opt/spark/sbin/start-master.sh
docker run -ti -p 8080:8080 -p 7077:7077 -p 4040:4040 -e SPARK_NO_DAEMONIZE=true --network=spark --name spark-master apache/spark:v3.3.0 /opt/spark/sbin/start-worker.sh spark://<master>:7077
Once they both start, I try and launch the following application code from my IDE (written in Kotlin, but doesn't matter if it's also in Java)
var sparkSession = SparkSession.builder()
.appName("mapreduce")
.master("spark://localhost:7077")
.config("spark.dynamicAllocation.enabled", "false")
.orCreate
var dataset: Dataset<String> = sparkSession.createDataset(listOf("Banana", "Car", "Glass", "Banana", "Computer", "Car"),
Encoders.STRING())
dataset = dataset.map(MapFunction{c: String->"word: "+c} , Encoders.STRING())
dataset.show()
The code works if master is local. I opened localhost:8080 and localhost:8081 and I can see the job getting registered. So, why am I getting a warning message as follows
22/09/05 01:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220905001717-0003/1 is now EXITED (Command exited with code 1)
22/09/05 01:17:38 INFO StandaloneSchedulerBackend: Executor app-20220905001717-0003/1 removed: Command exited with code 1
22/09/05 01:17:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
22/09/05 01:17:38 INFO BlockManagerMaster: Removal of executor 1 requested
22/09/05 01:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20220905001717-0003/2 on worker-20220905000436-172.18.0.3-8082 (172.18.0.3:8082) with 8 core(s)
22/09/05 01:17:38 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
22/09/05 01:17:38 INFO StandaloneSchedulerBackend: Granted executor ID app-20220905001717-0003/2 on hostPort 172.18.0.3:8082 with 8 core(s), 1024.0 MiB RAM
22/09/05 01:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220905001717-0003/2 is now RUNNING
22/09/05 01:17:40 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Related

can driver program and cluster manager(resource manager) can available on same machine in spark stand alone

I'm doing spark submit from the same machine as spark master, using following command ./bin/spark-submit --master spark://ip:port --deploy-mode "client" test.py I'm my application running forever with following kind of output
22/11/18 13:17:37 INFO BlockManagerMaster: Removal of executor 8 requested
22/11/18 13:17:37 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 8
22/11/18 13:17:37 INFO StandaloneSchedulerBackend: Granted executor ID app-20221118131723-0008/10 on hostPort 192.168.210.94:37443 with 2 core(s), 1024.0 MiB RAM
22/11/18 13:17:37 INFO BlockManagerMasterEndpoint: Trying to remove executor 8 from BlockManagerMaster.
22/11/18 13:17:37 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221118131723-0008/10 is now RUNNING
22/11/18 13:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20221118131723-0008/9 is now EXITED (Command exited with code 1)
22/11/18 13:17:38 INFO StandaloneSchedulerBackend: Executor app-20221118131723-0008/9 removed: Command exited with code 1
22/11/18 13:17:38 INFO BlockManagerMaster: Removal of executor 9 requested
22/11/18 13:17:38 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 9
22/11/18 13:17:38 INFO BlockManagerMasterEndpoint: Trying to remove executor 9 from BlockManagerMaster.
22/11/18 13:17:38 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20221118131723-0008/11 on worker-20221118111836-192.168.210.82-46395 (192.168.210.82:4639
But when I run from other nodes, my application is running successfully what could be reason?

pyspark tasks stuck on Airflow and Spark Standalone Cluster with Docker-compose

I setup Airflow and Spark standalone cluster on docker-compose.
Airflow run spark-submit tasks via spark client mode, which are submitted directly to spark master. However when I execute spark-submit task, the task got stuck.
Spark-submit Command:
spark-submit --verbose --master spark:7077 --name dummy_sql_spark_job ${AIRFLOW_HOME}/dags/spark/spark_sql.py
What i see from spark-submit driver logs:
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220104070012-0011/1 is now EXITED (Command exited with code 1)
22/01/04 07:02:19 INFO StandaloneSchedulerBackend: Executor app-20220104070012-0011/1 removed: Command exited with code 1
22/01/04 07:02:19 INFO BlockManagerMaster: Removal of executor 1 requested
22/01/04 07:02:19 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 1
22/01/04 07:02:19 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20220104070012-0011/5 on worker-20220104061702-172.27.0.9-38453 (172.27.0.9:38453) with 1 core(s)
22/01/04 07:02:19 INFO StandaloneSchedulerBackend: Granted executor ID app-20220104070012-0011/5 on hostPort 172.27.0.9:38453 with 1 core(s), 1024.0 MiB RAM
22/01/04 07:02:19 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20220104070012-0011/5 is now RUNNING
22/01/04 07:02:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:02:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:02:58 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:13 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:28 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
22/01/04 07:03:43 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
What i see from one of the spark workers:
spark-worker-1_1 | 22/01/04 07:02:18 INFO SecurityManager: Changing modify acls groups to:
spark-worker-1_1 | 22/01/04 07:02:18 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set()
spark-worker-1_1 | 22/01/04 07:02:19 INFO ExecutorRunner: Launch command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=5001" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#172.27.0.6:5001" "--executor-id" "3" "--hostname" "172.27.0.11" "--cores" "1" "--app-id" "app-20220104070012-0011" "--worker-url" "spark://Worker#172.27.0.11:35093"
Versions:
Airflow image: apache/airflow:2.2.3
Spark driver version: 3.1.2
Spark server: 3.2.0
Network
All containers airflow-scheduler, airflow-webserver, spark-master, spark-worker-n connected to same external network.
spark-driver is installed under airflow containers (scheduler, webserver), because corresponding dags and tasks are executed by airflow-scheduler.
UPDATE
After replacing driver spark version to match the master's one 3.2.0, the issue get disappeared. So it means, that in my particular case the issue was not due to connectivity between different spark actors (driver, master, worker/executor), but due to version mismatch. For some reason spark workers does not log corresponding error, which is misleading.
Most of the threads was pointing to connectivity issues. However in my case issue was due to mismatch of spark's driver vs master/worker version.
After replacing driver spark version to match the master's one 3.2.0, as well as ensure the same python version both on driver and executor sides (3.9.10) the issue get disappeared. So it means, that in my particular case the issue was not due to connectivity between different spark actors (driver, master, worker/executor), but due to version mismatch. For some reason spark workers does not log corresponding error, which is misleading.

Spark Pod restarting every hour in Kubernetes

I have deployed spark applications in cluster-mode in kubernetes. The spark application pod is getting restarted almost every hour.
The driver log has this message before restart:
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 1 on x.x.x.x: The executor with id 1 was deleted by a user or the framework.
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 2 on y.y.y.y: The executor with id 2 was deleted by a user or the framework.
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 1 (epoch 0)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, x.x.x.x, 44879, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 0)
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 2 (epoch 1)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, y.y.y.y, 46191, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 1)
20/07/11 13:34:02 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
20/07/11 13:34:16 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
And the Executor log has:
20/07/11 15:55:01 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
20/07/11 15:55:01 INFO MemoryStore: MemoryStore cleared
20/07/11 15:55:01 INFO BlockManager: BlockManager stopped
20/07/11 15:55:01 INFO ShutdownHookManager: Shutdown hook called
How can I find what's causing the executors deletion?
Deployment:
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 0 max surge
Pod Template:
Labels: app=test
chart=test-2.0.0
heritage=Tiller
product=testp
release=test
service=test-spark
Containers:
test-spark:
Image: test-spark:2df66df06c
Port: <none>
Host Port: <none>
Command:
/spark/bin/start-spark.sh
Args:
while true; do sleep 30; done;
Limits:
memory: 4Gi
Requests:
memory: 4Gi
Liveness: exec [/spark/bin/liveness-probe.sh] delay=300s timeout=1s period=30s #success=1 #failure=10
Environment:
JVM_ARGS: -Xms256m -Xmx1g
KUBERNETES_MASTER: https://kubernetes.default.svc
KUBERNETES_NAMESPACE: test-spark
IMAGE_PULL_POLICY: Always
DRIVER_CPU: 1
DRIVER_MEMORY: 2048m
EXECUTOR_CPU: 1
EXECUTOR_MEMORY: 2048m
EXECUTOR_INSTANCES: 2
KAFKA_ADVERTISED_HOST_NAME: kafka.default:9092
ENRICH_KAFKA_ENRICHED_EVENTS_TOPICS: test-events
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: test-spark-5c5997b459 (1/1 replicas created)
Events: <none>
I did a quick research on running Spark on Kubernetes, and it seems that Spark by design will terminate executor pod when they finished running Spark applications. Quoted from the official Spark website:
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
Therefore, I believe there is nothing to worry about the restarts as long as your Spark instance still manages to start executor pods as and when required.
Reference: https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#how-it-works
I don't know how you configured your application pod but you can use this to stop restarting pod include this in your deployment yaml file so that pod will never restart and you can debug the pod onwards.
restartPolicy: Never

All executors finish with state KILLED and exitStatus 1

I am trying to setup a local Spark cluster. I am using Spark 2.4.4 on Windows 10 machine.
To start the master and one worker I do
spark-class org.apache.spark.deploy.master.Master
spark-class org.apache.spark.deploy.worker.Worker 172.17.1.230:7077
After submitting an application to the cluster, it finishes successfully but in the Spark web admin UI it says that the application is KILLED. It's also what I get from worker logs. I have tried running my own examples and examples included in the Spark installation. They all get killed with exitStatus 1.
To start spark JavaSparkPi example from spark installation folder
Spark> spark-submit --master spark://172.17.1.230:7077 --class org.apache.spark.examples.JavaSparkPi .\examples\jars\spark-examples_2.11-2.4.4.jar
Part of the log after finishing calculation outputs
20/01/19 18:55:11 INFO DAGScheduler: Job 0 finished: reduce at JavaSparkPi.java:54, took 4.183853 s
Pi is roughly 3.13814
20/01/19 18:55:11 INFO SparkUI: Stopped Spark web UI at http://Nikola-PC:4040
20/01/19 18:55:11 INFO StandaloneSchedulerBackend: Shutting down all executors
20/01/19 18:55:11 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/01/19 18:55:11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/01/19 18:55:11 WARN TransportChannelHandler: Exception in connection from /172.17.1.230:58560
java.io.IOException: An existing connection was forcibly closed by the remote host
stderr log of the completed application outputs this at the end
20/01/19 18:55:11 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 910 bytes result sent to driver
20/01/19 18:55:11 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 910 bytes result sent to driver
20/01/19 18:55:11 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
The worker log outputs
20/01/19 18:55:06 INFO ExecutorRunner: Launch command: "C:\Program Files\Java\jdk1.8.0_231\bin\java" "-cp" "C:\Users\nikol\Spark\bin\..\conf\;C:\Users\nikol\Spark\jars\*" "-Xmx1024M" "-Dspark.driver.port=58484" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#Nikola-PC:58484" "--executor-id" "0" "--hostname" "172.17.1.230" "--cores" "12" "--app-id" "app-20200119185506-0001" "--worker-url" "spark://Worker#172.17.1.230:58069"
20/01/19 18:55:11 INFO Worker: Asked to kill executor app-20200119185506-0001/0
20/01/19 18:55:11 INFO ExecutorRunner: Runner thread for executor app-20200119185506-0001/0 interrupted
20/01/19 18:55:11 INFO ExecutorRunner: Killing process!
20/01/19 18:55:11 INFO Worker: Executor app-20200119185506-0001/0 finished with state KILLED exitStatus 1
I have tried with Spark 2.4.4 for Hadoop 2.6 and 2.7. The problem remains in both the cases.
This problem is the same as this one.

Executor finished with state KILLED exitStatus 1

After starting the master and a worker on one single machine...
spark-class org.apache.spark.deploy.master.Master -i 127.0.0.1 -p 7070
spark-class org.apache.spark.deploy.worker.Worker 127.0.0.1:7070
and submitting the following Spark job...
spark-submit --class Main --master spark://127.0.0.1:7070 --deploy-mode client /path/to/app.jar
the application is successfully executed but the executor is for some reason forcefully killed:
19/05/10 09:28:31 INFO Worker: Asked to kill executor app-20190510092810-0000/0
19/05/10 09:28:31 INFO ExecutorRunner: Runner thread for executor app-20190510092810-0000/0 interrupted
19/05/10 09:28:31 INFO ExecutorRunner: Killing process!
19/05/10 09:28:31 INFO Worker: Executor app-20190510092810-0000/0 finished with state KILLED exitStatus 1
Is this normal behavior? If not, how can I prevent this from happening?
I am using Spark 2.4.0.

Resources