Recovering from Kubernetes node failure running Cassandra - cassandra

I'm looking for a good solution to replace dead Kubernetes worker node that was running Cassandra in Kubernetes.
Scenario:
Cassandra cluster built from 3 pods
Failure occurs on one of the Kubernetes worker nodes
Replacement node is joining the cluster
New pod from StatefulSet is scheduled on new node
As pod IP address has changed, new pod is visible as new Cassandra node (4 nodes in cluster in total) and is unable to bootstrap until
the dead one is removed.
It's very difficult to follow the official procedure, as Cassandra is running as StatefulSet.
One completely hacky workaround I've found is to use ConfigMap to supply JAVA_OPTS. As changing ConfigMap doesn't recreate pods (yet), you can manipulate running pods in such way that you will be able to follow the procedure.
However, that's, as I mentioned, super hacky. I'm wondering if anyone is running Cassandra on top of Kubernetes and has a better idea how to deal with such failure?

Jetstack navigator supports this, but it's currently in alpha:
https://github.com/jetstack/navigator

unable to bootstrap until the dead one is removed.
Why is that?
I use the statefulset and I'm able to kill a pod and have a new one join in

Related

How to scale up Kubernetes cluster with Terraform avoiding downtime?

Here's the scenario: we have some applications running on a Kubernetes cluster on Azure. Currently our production cluster has one Nodepool with 3 nodes which are fairly low on resources because we still don't have that many active users/requests simultaneously.
Our backend APIs app is running on three pods, one on each node. I was told I will have need to increase resources soon (I'm thinking more memory or even replacing the VMs of the nodes with better ones).
We structured everything Kubernetes related using Terraform and I know that replacing VMs in a node is a destructive action, meaning the cluster will have to be replaces, new config and all deployments, services and etc will have to be reapplied.
I am fairly new to the Kubernetes and Terraform world, meaning I can do the basics to get an application up and running but I would like to learn what is the best practice when it comes to scaling and performance. How can I perform such increase in resources without having any downtime of our services?
I'm wondering if having an extra Nodepool would help while I replace the VM's of the other one (I might be absolutely wrong here)
If there's any link, course, tutorial you can point me to it's highly appreciated.
(Moved from comments)
In Azure, when you're performing cluster upgrade, there's a parameter called "max surge count" which is equal to 1 by default. What it means is when you update your cluster or node configuration, it will first create one extra node with the updated configuration - and only then it will safely drain and remove one of old ones. More on this here: Azure - Node Surge Upgrade

Need to upgrade AKS version from 1.14.8 to 1.15.10. Not sure if the Nodes will reboot with this or not

Need to upgrade AKS version from 1.14.8 to 1.15.10. Not sure if the Nodes will reboot with this or not.
Could anyone pls clear my doubt on this
If you are using higher level controllers such as deployment and running multiple replicas of the pod then you are not going to have a downtime in your application because kubernetes will guarantee that replicas of pod get distributed between different kubernetes nodes and when a particular node is cordoned/drained for upgrade or maintenance you still have other replica of the pod running in other nodes.
If you use pod directly then you are going to have downtime in your application while upgrade is happening.
Reading documetation we can find:
During the upgrade process, AKS adds a new node to the cluster that runs the specified Kubernetes version, then carefully cordon and drains one of the old nodes to minimize disruption to running applications. When the new node is confirmed as running application pods, the old node is deleted.
They will not be rebooted, only replaced with new ones.
When we try to upgrade by default AKS will to upgrade nodes by increasing the existing node capacity. So one extra node will be spinup with kubernetes version you are planning to upgrade.
Then using rolling strategy it will try to upgrade the nodes one by one.
It will move all the pods to new extra node and deletes the old node. This cycle continues until all nodes are updated with latest version.
If we have replicaset or deployment then there should be no downtime ideally.
We can also use the concept of podAntiAffinity so that no 2 pods will be in same node, and there will be no downtime

Spark on Kubernetes: Is it possible to keep the crashed pods when a job fails?

I have the strange problem that a Spark job ran on Kubernetes fails with a lot of "Missing an output location for shuffle X" in jobs where there is a lot of shuffling going on. Increasing executor memory does not help. The same job run on just a single node of the Kubernetes cluster in local[*] mode runs fine however so I suspect it has to do with Kubernetes or underlying Docker.
When an executor dies, the pods are deleted immediately so I cannot track down why it failed. Is there an option that keeps failed pods around so I can view their logs?
You can view the logs of the previous terminated pod like this:
kubectl logs -p <terminated pod name>
Also use spec.ttlSecondsAfterFinished field of a Job as mentioned here
Executors are deleted by default on any failures and you cannot do anything with that unless you customize Spark on K8s code or use some advanced K8s tooling.
What you can do (and most probably is the easiest approach to start with) is configuring some external log collectors, eg. Grafana Loki which can be deployed with 1 click to any K8s cluster, or some ELK stack components. These will help you to persist logs even after pods are deleted.
There is a deleteOnTermination setting in the spark application yaml. See the spark-on-kubernetes README.md.
deleteOnTermination - (Optional)
DeleteOnTermination specify whether executor pods should be deleted in case of failure or normal termination. Maps to spark.kubernetes.executor.deleteOnTermination that is available since Spark 3.0.

Troubleshooting kubernetes removed pod

I have a problem with spark application on kuberenetes. Spark driver tries to create an executor pod and executor pod fails to start. The problem is that as soon as the pod fails, spark driver removes it and creates a new one. The new one fails dues to the same reason. So, how can i recover logs from already removed pods as it seems like default spark behavior on kubernetes. Also, i am not able to catch the pods since the removal is instantaneous. I have to wonder how i am ever supposed to fix the failing pod issue if i cannot recover the errors.
In your case it would be helpful to implement cluster logging. Even if the pod gets restarted or deleted, its logs will stay in a log aggregator storage.
There are more than one solution to the cluster logging, but most popular is EFK (Elasticsearch, Fluentd, Kibana).
Actually, you can go even without Elasticsearch and Kibana.
Check out an excellent article Application Logging in Kubernetes with fluentd by Rosemary Wang that explains how to configure fluentd to put aggregated logs to fluentd pod stdout and access it later using the command:
kubectl logs <fluentd pod>…

StatefulSet: pods stuck in unknown state

I'm experimenting with Cassandra and Redis on Kubernetes, using the examples for v1.5.1.
With a Cassandra StatefulSet, if I shutdown a node without draining or deleting it via kubectl, that node's Pod stays around forever (at least over a week, anyway), without being moved to another node.
With Redis, even though the pod sticks around like with Cassandra, the sentinel service starts a new pod, so the number of functional pods is always maintained.
Is there a way to automatically move the Cassandra pod to another node, if a node goes down? Or do I have to drain or delete the node manually?
Please refer to the documentation here.
Kubernetes (versions 1.5 or newer) will not delete Pods just because a
Node is unreachable. The Pods running on an unreachable Node enter the
‘Terminating’ or ‘Unknown’ state after a timeout. Pods may also enter
these states when the user attempts graceful deletion of a Pod on an
unreachable Node. The only ways in which a Pod in such a state can be
removed from the apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding,
kills the Pod and removes the entry from the apiserver.
Force deletion of the Pod by the user.
This was a behavioral change introduced in kubernetes 1.5, which allows StatefulSet to prioritize safety.
There is no way to differentiate between the following cases:
The instance being shut down without the Node object being deleted.
A network partition is introduced between the Node in question and the kubernetes-master.
Both these cases are seen as the kubelet on a Node being unresponsive by the Kubernetes master. If in the second case, we were to quickly create a replacement pod on a different Node, we may violate the at-most-one semantics guaranteed by StatefulSet, and have multiple pods with the same identity running on different nodes. At worst, this could even lead to split brain and data loss when running Stateful applications.
On most cloud providers, when an instance is deleted, Kubernetes can figure out that the Node is also deleted, and hence let the StatefulSet pod be recreated elsewhere.
However, if you're running on-prem, this may not happen. It is recommended that you delete the Node object from kubernetes as you power it down, or have a reconciliation loop keeping the Kubernetes idea of Nodes in sync with the the actual nodes available.
Some more context is in the github issue.

Resources