Spark Executors PODS in Pending State on Kubernetes Deployment - apache-spark

I deployed a simple spark application on kubernetes with following configurations:
spark.executor.instances=2;spark.executor.memory=8g;
spark.dynamicAllocation.enabled=true;spark.dynamicAllocation.shuffleTracking.enabled=true;
spark.executor.cores=2;spark.dynamicAllocation.minExecutors=2;spark.dynamicAllocation.maxExecutors=2;
Memory requirements of Executor PODS are more than what is available on Kubernetes cluster and due to this Spark Executor PODS always stay in PENDING state as below.
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/spark-k8sdemo-6e66d576f655b1f5-exec-1 0/1 Pending 0 10m
pod/spark-k8sdemo-6e66d576f655b1f5-exec-2 0/1 Pending 0 10m
pod/spark-master-6d9bc767c6-qsk8c 1/1 Running 0 10m
I know the reason is non-availability of resources as show in Kubectl describe command:
$ kubectl describe pod/spark-k8sdemo-6e66d576f655b1f5-exec-1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 28s (x12 over 12m) default-scheduler 0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.
On the other hand, driver pods keeps waiting forever for Executor PODS to get ample resources as below.
$ kubectl logs pod/spark-master-6d9bc767c6-qsk8c
21/01/12 11:36:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
21/01/12 11:37:01 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
21/01/12 11:37:16 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Now my question here is if there is some way to make driver wait only for sometime/retries and if Executors still don't get resource, driver POD should auto-die with printing proper message/logs e.g. "application aborted as there were no resources in cluster".
I went through all Spark configurations for above requirement but couldn't fine any. Though in YARN we have spark.yarn.maxAppAttempts but nothing similar was found for Kubernetes.
If no such configuration is available in Spark Is there a way in kubernetes POD definition to achieve the same.

This is Apache Spark 3.0.1 here. No idea if things get any different in the upcoming 3.1.1.
tl;dr I don't think there's a built-in support for "driver wait only for sometime/retries and if Executors still don't get resource, driver POD should auto-die with printing proper message/logs".
My very basic understanding of Spark on Kubernetes lets me claim that there is no such feature to "auto-die" the driver pod when there are no resources for executor pods.
There is podCreationTimeout (based on spark.kubernetes.allocation.batch.delay configuration property) and spark.kubernetes.executor.deleteOnTermination configuration property that make Spark on Kubernetes delete executor pods requested but not created, but that's not really what you want.
Dynamic Allocation of Executors could make things a bit more complex, but it does not really matter in this case.
A workaround could be to use spark-submit --status to request status of a Spark application and check whether it's up and running or not and --kill it after a certain time threshold (you could achieve a similar thing using kubectl directly too).
Just a FYI and to make things a bit more interesting. You should be reviewing the two other Spark on Kubernetes-specific configuration properties:
spark.kubernetes.driver.request.cores
spark.kubernetes.executor.request.cores
There could be others.

After lot of digging, we are finally able to utilize SparkListener to check if Spark Application has started and ample number of executors have been registered. If this condition meets then proceed further with Spark jobs else return with a Warning stating no ample resources in kubernetes cluster to run that Spark Job.

Is there a way in kubernetes POD definition to achieve the same.
You could use an init container in your Spark Driver pods that confirms the Spark Exec pods are available.
The init container could be a simple shell script. The logic could be written so it only tries so many times and or times out.
In a pod definition all init containers must succeed before continuing to any other containers in the pod.

Related

Spark Streaming on Kubernetes - Executor Pods not Restarted/Rescheduled

We are running DStream applications on Kubernetes cluster using Spark Operator (Spark 2.4.7). Sometimes due to various reasons (OOM's, Kubernetes node restarts) executor pods are getting lost, and while many times Spark sees this and reschedules a new executor, eventually (after a week or more) most of the applications are getting to the state when the executors are not getting rescheduled and the application continues running with less executors than requested. In Spark UI those "forever lost" executors are shown as healthy, but obviously they aren't fetching any data from Kafka. The only way to make sure application works as expected is to recreate the SparkApplication CRD, which basically means hard restart.
You can find the restart policy section of SparkApplication CRD below:
restartPolicy:
onFailureRetries: 100
onFailureRetryInterval: 20
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 30
type: Always

Spark job on Kubernetes Under Resource Starvation Wait Indefinitely For SPARK_MIN_EXECUTORS

I am using Spark 3.0.1 and working on a project spark deployment on Kubernetes where Kubernetes acting cluster manager for spark job and spark submits the job using client mode. In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum number of executors , the executors goes in Pending State for indefinite time until the resource gets free.
Suppose, Cluster Configurations are:
total Memory=204Gi
used Memory=200Gi
free memory= 4Gi
SPARK.EXECUTOR.MEMORY=10G
SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
Here job should not be submitted as executors allocated are less than MIN_EXECUTORS.
How can driver abort the job in this scenario?
Firstly would like to mention that, spark dynamic allocation not supported for kubernetes yet(as of version 3.0.1), its in pipeline for future release Link
while for the requirement you have posted, you could address by running a resource monitor code snippet before the job initialized and terminate the initialization pod itself with error.
if you want to run this from CLI you could use kubectl describe nodes/ kube-capacity utility to monitor the resources

Spark on Kubernetes: Is it possible to keep the crashed pods when a job fails?

I have the strange problem that a Spark job ran on Kubernetes fails with a lot of "Missing an output location for shuffle X" in jobs where there is a lot of shuffling going on. Increasing executor memory does not help. The same job run on just a single node of the Kubernetes cluster in local[*] mode runs fine however so I suspect it has to do with Kubernetes or underlying Docker.
When an executor dies, the pods are deleted immediately so I cannot track down why it failed. Is there an option that keeps failed pods around so I can view their logs?
You can view the logs of the previous terminated pod like this:
kubectl logs -p <terminated pod name>
Also use spec.ttlSecondsAfterFinished field of a Job as mentioned here
Executors are deleted by default on any failures and you cannot do anything with that unless you customize Spark on K8s code or use some advanced K8s tooling.
What you can do (and most probably is the easiest approach to start with) is configuring some external log collectors, eg. Grafana Loki which can be deployed with 1 click to any K8s cluster, or some ELK stack components. These will help you to persist logs even after pods are deleted.
There is a deleteOnTermination setting in the spark application yaml. See the spark-on-kubernetes README.md.
deleteOnTermination - (Optional)
DeleteOnTermination specify whether executor pods should be deleted in case of failure or normal termination. Maps to spark.kubernetes.executor.deleteOnTermination that is available since Spark 3.0.

Spark driver pod eviction Kubernetes

What would be the recommended approach to wait for spark driver pod to complete the currently running job before it gets evicted to new nodes while maintenance on the current node is goingon (kernel upgrade,hardware maintenance etc..) using drain command
I don’t think I can use PoDisruptionBudget as Spark pods deployment yaml(s) is taken by Kubernetes

How can I run an Apache Spark shell remotely?

I have a Spark cluster setup with one master and 3 workers. I also have Spark installed on a CentOS VM. I'm trying to run a Spark shell from my local VM which would connect to the master, and allow me to execute simple Scala code. So, here is the command I run on my local VM:
bin/spark-shell --master spark://spark01:7077
The shell runs to the point where I can enter Scala code. It says that executors have been granted (x3 - one for each worker). If I peek at the Master's UI, I can see one running application, Spark shell. All the workers are ALIVE, have 2 / 2 cores used, and have allocated 512 MB (out of 5 GB) to the application. So, I try to execute the following Scala code:
sc.parallelize(1 to 100).count
Unfortunately, the command doesn't work. The shell will just print the same warning endlessly:
INFO SparkContext: Starting job: count at <console>:13
INFO DAGScheduler: Got job 0 (count at <console>:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Final stage: Stage 0(count at <console>:13) with 2 output partitions (allowLocal=false)
INFO DAGScheduler: Parents of final stage: List()
INFO DAGScheduler: Missing parents: List()
INFO DAGScheduler: Submitting Stage 0 (Parallel CollectionRDD[0] at parallelize at <console>:13), which has no missing parents
INFO DAGScheduler: Submitting 2 missing tasts from Stage 0 (ParallelCollectionRDD[0] at parallelize at <console>:13)
INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Following my research into the issue, I have confirmed that the master URL I am using is identical to the one on the web UI. I can ping and ssh both ways (cluster to local VM, and vice-versa). Moreover, I have played with the executor-memory parameter (both increasing and decreasing the memory) to no avail. Finally, I tried disabling the firewall (iptables) on both sides, but I keep getting the same error. I am using Spark 1.0.2.
TL;DR Is it possible to run an Apache Spark shell remotely (and inherently submit applications remotely)? If so, what am I missing?
EDIT: I took a look at the worker logs and found that the workers had trouble finding Spark:
ERROR org.apache.spark.deploy.worker.ExecutorRunner: Error running executor
java.io.IOException: Cannot run program "/usr/bin/spark-1.0.2/bin/compute-classpath.sh" (in directory "."): error=2, No such file or directory
...
Spark is installed in a different directory on my local VM than on the cluster. The path the worker is attempting to find is the one on my local VM. Is there a way for me to specify this path? Or must they be identical everywhere?
For the moment, I adjusted my directories to circumvent this error. Now, my Spark Shell fails before I get the chance to enter the count command (Master removed our application: FAILED). All the workers have the same error:
ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkWorker#spark02:7078] -> [akka.tcp://sparkExecutor#spark02:53633]:
Error [Association failed with [akka.tcp://sparkExecutor#spark02:53633]]
[akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkExecutor#spark02:53633]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$annon2: Connection refused: spark02/192.168.64.2:53633
As suspected, I am running into network issues. What should I look at now?
I solve this problem at my spark client and spark cluster。
Check your network,client A can ping cluster each other! Then add two line config in your spark-env.sh on client A。
first
export SPARK_MASTER_IP=172.100.102.156
export SPARK_JAR=/usr/spark-1.1.0-bin-hadoop2.4/lib/spark-assembly-1.1.0-hadoop2.4.0.jar
Second
Test your spark shell with cluster mode !
This problem can be caused by the network configuration. It looks like the error TaskSchedulerImpl: Initial job has not accepted any resources can have quite a few causes (see also this answer):
actual resource shortage
broken communication between master and workers
broken communication between master/workers and driver
The easiest way to exclude the first possibilities is to run a test with a Spark shell running directly on the master. If this works, the cluster communication within the cluster itself is fine and the problem is caused by the communication to the driver host. To further analyze the problem it helps to look into the worker logs, which contain entries like
16/08/14 09:21:52 INFO ExecutorRunner: Launch command:
"/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java"
...
"--driver-url" "spark://CoarseGrainedScheduler#192.168.1.228:37752"
...
and test whether the worker can establish a connection to the driver's IP/port. Apart from general firewall / port forwarding issues, it might be possible that the driver is binding to the wrong network interface. In this case you can export SPARK_LOCAL_IP on the driver before starting the Spark shell in order to bind to a different interface.
Some additional references:
Knowledge base entry on network connectivity issues.
Github discussion on improving the documentation of Initial job has not accepted any resources.

Resources