Troubleshooting kubernetes removed pod - apache-spark

I have a problem with spark application on kuberenetes. Spark driver tries to create an executor pod and executor pod fails to start. The problem is that as soon as the pod fails, spark driver removes it and creates a new one. The new one fails dues to the same reason. So, how can i recover logs from already removed pods as it seems like default spark behavior on kubernetes. Also, i am not able to catch the pods since the removal is instantaneous. I have to wonder how i am ever supposed to fix the failing pod issue if i cannot recover the errors.

In your case it would be helpful to implement cluster logging. Even if the pod gets restarted or deleted, its logs will stay in a log aggregator storage.
There are more than one solution to the cluster logging, but most popular is EFK (Elasticsearch, Fluentd, Kibana).
Actually, you can go even without Elasticsearch and Kibana.
Check out an excellent article Application Logging in Kubernetes with fluentd by Rosemary Wang that explains how to configure fluentd to put aggregated logs to fluentd pod stdout and access it later using the command:
kubectl logs <fluentd pod>…

Related

Running Spark in Kubernetes client mode, if the executor ConfigMap creation fails, is there a way to recover

If the call to create the Kubernetes spark-conf-volume-exec ConfigMap fails (see https://github.com/apache/spark/blob/02a055a42de5597cd42c1c0d4470f0e769571dc3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala#L80), is there a way recover so that the executors will still start?
If the ConfigMap is not successfully created on the first try, the driver continues to run but does not attempt again to create the ConfigMap, and all executors fail to start because the spark-conf-volume-exec ConfigMap does not exist (see https://github.com/apache/spark/blob/02a055a42de5597cd42c1c0d4470f0e769571dc3/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala#L98)
Alternatively, is it possible to create the ConfigMap before starting the Spark driver? In Spark 3.2.1, spark.kubernetes.executor.disableConfigMap was added, but since KubernetesClientUtils.configMapNameExecutor is randomly generated each time, I do not see a clear way to use that option.

Spark on Kubernetes: Is it possible to keep the crashed pods when a job fails?

I have the strange problem that a Spark job ran on Kubernetes fails with a lot of "Missing an output location for shuffle X" in jobs where there is a lot of shuffling going on. Increasing executor memory does not help. The same job run on just a single node of the Kubernetes cluster in local[*] mode runs fine however so I suspect it has to do with Kubernetes or underlying Docker.
When an executor dies, the pods are deleted immediately so I cannot track down why it failed. Is there an option that keeps failed pods around so I can view their logs?
You can view the logs of the previous terminated pod like this:
kubectl logs -p <terminated pod name>
Also use spec.ttlSecondsAfterFinished field of a Job as mentioned here
Executors are deleted by default on any failures and you cannot do anything with that unless you customize Spark on K8s code or use some advanced K8s tooling.
What you can do (and most probably is the easiest approach to start with) is configuring some external log collectors, eg. Grafana Loki which can be deployed with 1 click to any K8s cluster, or some ELK stack components. These will help you to persist logs even after pods are deleted.
There is a deleteOnTermination setting in the spark application yaml. See the spark-on-kubernetes README.md.
deleteOnTermination - (Optional)
DeleteOnTermination specify whether executor pods should be deleted in case of failure or normal termination. Maps to spark.kubernetes.executor.deleteOnTermination that is available since Spark 3.0.

Get Dataproc Logs to Stackdriver Logging

I am running Dataproc and submitting Spark Jobs using the default client-mode.
The logs for the jobs are visible in the GCP console and is available in the GCS bucket. However, I would like to see the logs in Stackdriver Logging.
Currently, the only way I found was to use cluster-mode instead.
Is there a way to push logs to Stackdriver when using client-mode?
This is something the Dataproc team is actively working on and should have a solution for you sometime soon. If you want to file a public feature request for tracking this that is an option, but I will try to update this response when this feature is usable by you.
Digging into it a bit, the reason why you can see the logs when using cluster-mode is that we have Fluentd configurations that pick up YARN container logs (userlogs) by default. When running in cluster-mode the driver runs in a YARN container and those logs are picked up by that configuration.
Currently, output produced by the driver is forwarded directly to GCS by the Dataproc agent. In the future there will be an option to have all driver output sent to Stackdriver when starting a cluster.
Update:
This feature is now in Beta and is stable to use. When creating a Cluster, the property "dataproc:dataproc.logging.stackdriver.job.driver.enable" can be used to toggle whether the cluster will send Job driver logs to Stackdriver. Additionally you can use the property "dataproc:dataproc.logging.stackdriver.job.yarn.container.enable" to have the cluster associate YARN container logs with the Job they were created by instead of the Cluster they ran on.
Documentation is available here

Recovering from Kubernetes node failure running Cassandra

I'm looking for a good solution to replace dead Kubernetes worker node that was running Cassandra in Kubernetes.
Scenario:
Cassandra cluster built from 3 pods
Failure occurs on one of the Kubernetes worker nodes
Replacement node is joining the cluster
New pod from StatefulSet is scheduled on new node
As pod IP address has changed, new pod is visible as new Cassandra node (4 nodes in cluster in total) and is unable to bootstrap until
the dead one is removed.
It's very difficult to follow the official procedure, as Cassandra is running as StatefulSet.
One completely hacky workaround I've found is to use ConfigMap to supply JAVA_OPTS. As changing ConfigMap doesn't recreate pods (yet), you can manipulate running pods in such way that you will be able to follow the procedure.
However, that's, as I mentioned, super hacky. I'm wondering if anyone is running Cassandra on top of Kubernetes and has a better idea how to deal with such failure?
Jetstack navigator supports this, but it's currently in alpha:
https://github.com/jetstack/navigator
unable to bootstrap until the dead one is removed.
Why is that?
I use the statefulset and I'm able to kill a pod and have a new one join in

StatefulSet: pods stuck in unknown state

I'm experimenting with Cassandra and Redis on Kubernetes, using the examples for v1.5.1.
With a Cassandra StatefulSet, if I shutdown a node without draining or deleting it via kubectl, that node's Pod stays around forever (at least over a week, anyway), without being moved to another node.
With Redis, even though the pod sticks around like with Cassandra, the sentinel service starts a new pod, so the number of functional pods is always maintained.
Is there a way to automatically move the Cassandra pod to another node, if a node goes down? Or do I have to drain or delete the node manually?
Please refer to the documentation here.
Kubernetes (versions 1.5 or newer) will not delete Pods just because a
Node is unreachable. The Pods running on an unreachable Node enter the
‘Terminating’ or ‘Unknown’ state after a timeout. Pods may also enter
these states when the user attempts graceful deletion of a Pod on an
unreachable Node. The only ways in which a Pod in such a state can be
removed from the apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding,
kills the Pod and removes the entry from the apiserver.
Force deletion of the Pod by the user.
This was a behavioral change introduced in kubernetes 1.5, which allows StatefulSet to prioritize safety.
There is no way to differentiate between the following cases:
The instance being shut down without the Node object being deleted.
A network partition is introduced between the Node in question and the kubernetes-master.
Both these cases are seen as the kubelet on a Node being unresponsive by the Kubernetes master. If in the second case, we were to quickly create a replacement pod on a different Node, we may violate the at-most-one semantics guaranteed by StatefulSet, and have multiple pods with the same identity running on different nodes. At worst, this could even lead to split brain and data loss when running Stateful applications.
On most cloud providers, when an instance is deleted, Kubernetes can figure out that the Node is also deleted, and hence let the StatefulSet pod be recreated elsewhere.
However, if you're running on-prem, this may not happen. It is recommended that you delete the Node object from kubernetes as you power it down, or have a reconciliation loop keeping the Kubernetes idea of Nodes in sync with the the actual nodes available.
Some more context is in the github issue.

Resources