Spark on Kubernetes: Is it possible to keep the crashed pods when a job fails? - apache-spark

I have the strange problem that a Spark job ran on Kubernetes fails with a lot of "Missing an output location for shuffle X" in jobs where there is a lot of shuffling going on. Increasing executor memory does not help. The same job run on just a single node of the Kubernetes cluster in local[*] mode runs fine however so I suspect it has to do with Kubernetes or underlying Docker.
When an executor dies, the pods are deleted immediately so I cannot track down why it failed. Is there an option that keeps failed pods around so I can view their logs?

You can view the logs of the previous terminated pod like this:
kubectl logs -p <terminated pod name>
Also use spec.ttlSecondsAfterFinished field of a Job as mentioned here

Executors are deleted by default on any failures and you cannot do anything with that unless you customize Spark on K8s code or use some advanced K8s tooling.
What you can do (and most probably is the easiest approach to start with) is configuring some external log collectors, eg. Grafana Loki which can be deployed with 1 click to any K8s cluster, or some ELK stack components. These will help you to persist logs even after pods are deleted.

There is a deleteOnTermination setting in the spark application yaml. See the spark-on-kubernetes README.md.
deleteOnTermination - (Optional)
DeleteOnTermination specify whether executor pods should be deleted in case of failure or normal termination. Maps to spark.kubernetes.executor.deleteOnTermination that is available since Spark 3.0.

Related

Spark Executors PODS in Pending State on Kubernetes Deployment

I deployed a simple spark application on kubernetes with following configurations:
spark.executor.instances=2;spark.executor.memory=8g;
spark.dynamicAllocation.enabled=true;spark.dynamicAllocation.shuffleTracking.enabled=true;
spark.executor.cores=2;spark.dynamicAllocation.minExecutors=2;spark.dynamicAllocation.maxExecutors=2;
Memory requirements of Executor PODS are more than what is available on Kubernetes cluster and due to this Spark Executor PODS always stay in PENDING state as below.
$ kubectl get all
NAME READY STATUS RESTARTS AGE
pod/spark-k8sdemo-6e66d576f655b1f5-exec-1 0/1 Pending 0 10m
pod/spark-k8sdemo-6e66d576f655b1f5-exec-2 0/1 Pending 0 10m
pod/spark-master-6d9bc767c6-qsk8c 1/1 Running 0 10m
I know the reason is non-availability of resources as show in Kubectl describe command:
$ kubectl describe pod/spark-k8sdemo-6e66d576f655b1f5-exec-1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 28s (x12 over 12m) default-scheduler 0/1 nodes are available: 1 Insufficient cpu, 1 Insufficient memory.
On the other hand, driver pods keeps waiting forever for Executor PODS to get ample resources as below.
$ kubectl logs pod/spark-master-6d9bc767c6-qsk8c
21/01/12 11:36:46 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
21/01/12 11:37:01 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
21/01/12 11:37:16 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Now my question here is if there is some way to make driver wait only for sometime/retries and if Executors still don't get resource, driver POD should auto-die with printing proper message/logs e.g. "application aborted as there were no resources in cluster".
I went through all Spark configurations for above requirement but couldn't fine any. Though in YARN we have spark.yarn.maxAppAttempts but nothing similar was found for Kubernetes.
If no such configuration is available in Spark Is there a way in kubernetes POD definition to achieve the same.
This is Apache Spark 3.0.1 here. No idea if things get any different in the upcoming 3.1.1.
tl;dr I don't think there's a built-in support for "driver wait only for sometime/retries and if Executors still don't get resource, driver POD should auto-die with printing proper message/logs".
My very basic understanding of Spark on Kubernetes lets me claim that there is no such feature to "auto-die" the driver pod when there are no resources for executor pods.
There is podCreationTimeout (based on spark.kubernetes.allocation.batch.delay configuration property) and spark.kubernetes.executor.deleteOnTermination configuration property that make Spark on Kubernetes delete executor pods requested but not created, but that's not really what you want.
Dynamic Allocation of Executors could make things a bit more complex, but it does not really matter in this case.
A workaround could be to use spark-submit --status to request status of a Spark application and check whether it's up and running or not and --kill it after a certain time threshold (you could achieve a similar thing using kubectl directly too).
Just a FYI and to make things a bit more interesting. You should be reviewing the two other Spark on Kubernetes-specific configuration properties:
spark.kubernetes.driver.request.cores
spark.kubernetes.executor.request.cores
There could be others.
After lot of digging, we are finally able to utilize SparkListener to check if Spark Application has started and ample number of executors have been registered. If this condition meets then proceed further with Spark jobs else return with a Warning stating no ample resources in kubernetes cluster to run that Spark Job.
Is there a way in kubernetes POD definition to achieve the same.
You could use an init container in your Spark Driver pods that confirms the Spark Exec pods are available.
The init container could be a simple shell script. The logic could be written so it only tries so many times and or times out.
In a pod definition all init containers must succeed before continuing to any other containers in the pod.

When running Spark on Kubernetes, is it possible to run as another user as root?

When I submit a Spark job to Kubernetes, everything in the containers is run as root. Is it possible to run the job as another user?
When I submit the job in Client mode, the driver runs as the user who submitted it but the executors run as root, which might lead to file access problems when accessing files created by the executors.
Unless the full customization of the K8s Pod is supported by Spark on K8s (in particular runAsUser feature) the only ways to control it (as I see for the moment) are:
- Build docker image specifying USER in Dockerfile
- Use some advanced K8s tools/controllers, eg Argo Events
- Customize spark-submit or submit Spark Pods directly as Kubernetes Pods through K8s APIs
Hope to see some improvements coming with Spark v3.0.0 soonish though.

What's the most elegant/right way to stop a spark job running on a Kubernetes cluster?

I'm new to apache spark and I'm trying to run a spark job using spark-submit on my Kubernetes cluster. I was wondering if there's a right way to stop spark jobs once the driver and executor pods are spawned? Would deleting the pods themselves be enough?
Thanks!
When you will delete executor it will be recreated again and spark application will work. However if you will delete driver pod it will stop application.
So killing driver pod is actually the way to stop the Spark
Application during the execution.
As you are new to Spark and you want to run it on Kubernetes, you should check this tutorial.
At present the only way to stop Spark job running on Kuberentes is to delete the Driver Pod (unless you have an app controlling Spark context which is able to manipulate it). Since all other job-related resources are linked to Spark Driver Pod with such as called ownerReferences, they will be removed automatically by Kubernetes.
It should clean things up when the job completes automatically.

Troubleshooting kubernetes removed pod

I have a problem with spark application on kuberenetes. Spark driver tries to create an executor pod and executor pod fails to start. The problem is that as soon as the pod fails, spark driver removes it and creates a new one. The new one fails dues to the same reason. So, how can i recover logs from already removed pods as it seems like default spark behavior on kubernetes. Also, i am not able to catch the pods since the removal is instantaneous. I have to wonder how i am ever supposed to fix the failing pod issue if i cannot recover the errors.
In your case it would be helpful to implement cluster logging. Even if the pod gets restarted or deleted, its logs will stay in a log aggregator storage.
There are more than one solution to the cluster logging, but most popular is EFK (Elasticsearch, Fluentd, Kibana).
Actually, you can go even without Elasticsearch and Kibana.
Check out an excellent article Application Logging in Kubernetes with fluentd by Rosemary Wang that explains how to configure fluentd to put aggregated logs to fluentd pod stdout and access it later using the command:
kubectl logs <fluentd pod>…

Rest API for Spark2.3 submit on kubernetes(version 1.8.*) cluster

Im using kubernetes cluster on AWS to run spark jobs ,im using spark 2.3 ,now i want to run spark-submit from AWS lambda function to k8s master,would like to know if there is any REST interface to run Spark submit on k8s Master?
Unfortunately, it is not possible for Spark 2.3, in case you are using native Kubernetes support.
Based on description from deployment instruction, submission process contains several steps:
Spark creates a Spark driver running within a Kubernetes pod.
The driver creates executors which are also running within Kubernetes pods
The driver connects to them, and executes application code
When the application completes, executor pods terminate and are cleaned up, but the driver pod persists its logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
So, in fact, you have no place to submit a job until you start a submission process, which will launch the first Spark's pod (driver) for you. Only once application completes, everything is terminated.
Please also see similar answer for this question under the link

Resources