Need to upgrade AKS version from 1.14.8 to 1.15.10. Not sure if the Nodes will reboot with this or not - azure

Need to upgrade AKS version from 1.14.8 to 1.15.10. Not sure if the Nodes will reboot with this or not.
Could anyone pls clear my doubt on this

If you are using higher level controllers such as deployment and running multiple replicas of the pod then you are not going to have a downtime in your application because kubernetes will guarantee that replicas of pod get distributed between different kubernetes nodes and when a particular node is cordoned/drained for upgrade or maintenance you still have other replica of the pod running in other nodes.
If you use pod directly then you are going to have downtime in your application while upgrade is happening.

Reading documetation we can find:
During the upgrade process, AKS adds a new node to the cluster that runs the specified Kubernetes version, then carefully cordon and drains one of the old nodes to minimize disruption to running applications. When the new node is confirmed as running application pods, the old node is deleted.
They will not be rebooted, only replaced with new ones.

When we try to upgrade by default AKS will to upgrade nodes by increasing the existing node capacity. So one extra node will be spinup with kubernetes version you are planning to upgrade.
Then using rolling strategy it will try to upgrade the nodes one by one.
It will move all the pods to new extra node and deletes the old node. This cycle continues until all nodes are updated with latest version.
If we have replicaset or deployment then there should be no downtime ideally.
We can also use the concept of podAntiAffinity so that no 2 pods will be in same node, and there will be no downtime

Related

Upgrade virtual-node-aci-linux in Azure Kubernetes Cluster

Does anyone have any links or know how to upgrade the virtual-node-aci-linux on Azure?
I am currently on version v1.19.10-vk-azure-aci-v1.4.1 however my other node pools are now on v.1.22.11 after upgrading Kubernetes.
I am getting some odd behaviour since the upgrade, It seems I now have to specify a single instance in my VMMS for the virtual-node-aci-linux node to be ready. I don't remember having to do this before.
NAME STATUS ROLES AGE VERSION
aks-control-13294507-vmss000006 Ready agent 86s v1.22.11
virtual-node-aci-linux Ready agent 164d v1.19.10-vk-azure-aci-v1.4.1
Also previously I am sure that only my virtual-node-aci-linux was visible in the node list.
Any help would be appreciated.

How to scale up Kubernetes cluster with Terraform avoiding downtime?

Here's the scenario: we have some applications running on a Kubernetes cluster on Azure. Currently our production cluster has one Nodepool with 3 nodes which are fairly low on resources because we still don't have that many active users/requests simultaneously.
Our backend APIs app is running on three pods, one on each node. I was told I will have need to increase resources soon (I'm thinking more memory or even replacing the VMs of the nodes with better ones).
We structured everything Kubernetes related using Terraform and I know that replacing VMs in a node is a destructive action, meaning the cluster will have to be replaces, new config and all deployments, services and etc will have to be reapplied.
I am fairly new to the Kubernetes and Terraform world, meaning I can do the basics to get an application up and running but I would like to learn what is the best practice when it comes to scaling and performance. How can I perform such increase in resources without having any downtime of our services?
I'm wondering if having an extra Nodepool would help while I replace the VM's of the other one (I might be absolutely wrong here)
If there's any link, course, tutorial you can point me to it's highly appreciated.
(Moved from comments)
In Azure, when you're performing cluster upgrade, there's a parameter called "max surge count" which is equal to 1 by default. What it means is when you update your cluster or node configuration, it will first create one extra node with the updated configuration - and only then it will safely drain and remove one of old ones. More on this here: Azure - Node Surge Upgrade

Azure kubernetes service node pool upgrades & patches

I have some confusion on AKS Node pool upgrades and Patching. Could you please clarify on this.
I have one AKS node pool, which is having 4 nodes, so now I want to upgrade the kubernetes version only in two nodes of node pool. Is it possible?
if it is possible to upgrade in two nodes, then how we can upgrade remaining two nodes? and how we can find out which two nodes are having old kubernetes version instead of latest kubernetes version
While doing the Upgrade process, will it create two new nodes with latest kubernetes version, and then will it delete old nodes in node pool?
Actually azure automatically applies patches on nodes, but will it creates new nodes with new patches and deleted old nodes?
1. According to the docs:
you can upgrade specific node pool.
So the approach with additional node-pool mentioned by 4c74356b41.
Additional info:
Node upgrades
There is an additional process in AKS that lets you upgrade a cluster. An upgrade is typically to move to a newer version of Kubernetes, not just apply node security updates.
An AKS upgrade performs the following actions:
A new node is deployed with the latest security updates and Kubernetes version applied.
An old node is cordoned and drained.
Pods are scheduled on the new node.
The old node is deleted.
2. By default, AKS uses one additional node to configure upgrades.
You can control this process by increase --max-surge parameter
To speed up the node image upgrade process, you can upgrade your node images using a customizable node surge value.
3. Security and kernel updates to Linux nodes:
In an AKS cluster, your Kubernetes nodes run as Azure virtual machines (VMs). These Linux-based VMs use an Ubuntu image, with the OS configured to automatically check for updates every night. If security or kernel updates are available, they are automatically downloaded and installed.
Some security updates, such as kernel updates, require a node reboot to finalize the process. A Linux node that requires a reboot creates a file named /var/run/reboot-required. This reboot process doesn't happen automatically.
This tutorial summarize the process of Cluster Maintenance and Other Tasks
no, create another pool with 2 nodes and test your application there. or create another cluster. you can find node version with kubectl get nodes
it gradually updates nodes one by one (default). you can change these. spot instances cannot be upgraded.
yes, latest patch version image will be used

Recovering from Kubernetes node failure running Cassandra

I'm looking for a good solution to replace dead Kubernetes worker node that was running Cassandra in Kubernetes.
Scenario:
Cassandra cluster built from 3 pods
Failure occurs on one of the Kubernetes worker nodes
Replacement node is joining the cluster
New pod from StatefulSet is scheduled on new node
As pod IP address has changed, new pod is visible as new Cassandra node (4 nodes in cluster in total) and is unable to bootstrap until
the dead one is removed.
It's very difficult to follow the official procedure, as Cassandra is running as StatefulSet.
One completely hacky workaround I've found is to use ConfigMap to supply JAVA_OPTS. As changing ConfigMap doesn't recreate pods (yet), you can manipulate running pods in such way that you will be able to follow the procedure.
However, that's, as I mentioned, super hacky. I'm wondering if anyone is running Cassandra on top of Kubernetes and has a better idea how to deal with such failure?
Jetstack navigator supports this, but it's currently in alpha:
https://github.com/jetstack/navigator
unable to bootstrap until the dead one is removed.
Why is that?
I use the statefulset and I'm able to kill a pod and have a new one join in

StatefulSet: pods stuck in unknown state

I'm experimenting with Cassandra and Redis on Kubernetes, using the examples for v1.5.1.
With a Cassandra StatefulSet, if I shutdown a node without draining or deleting it via kubectl, that node's Pod stays around forever (at least over a week, anyway), without being moved to another node.
With Redis, even though the pod sticks around like with Cassandra, the sentinel service starts a new pod, so the number of functional pods is always maintained.
Is there a way to automatically move the Cassandra pod to another node, if a node goes down? Or do I have to drain or delete the node manually?
Please refer to the documentation here.
Kubernetes (versions 1.5 or newer) will not delete Pods just because a
Node is unreachable. The Pods running on an unreachable Node enter the
‘Terminating’ or ‘Unknown’ state after a timeout. Pods may also enter
these states when the user attempts graceful deletion of a Pod on an
unreachable Node. The only ways in which a Pod in such a state can be
removed from the apiserver are as follows:
The Node object is deleted (either by you, or by the Node Controller).
The kubelet on the unresponsive Node starts responding,
kills the Pod and removes the entry from the apiserver.
Force deletion of the Pod by the user.
This was a behavioral change introduced in kubernetes 1.5, which allows StatefulSet to prioritize safety.
There is no way to differentiate between the following cases:
The instance being shut down without the Node object being deleted.
A network partition is introduced between the Node in question and the kubernetes-master.
Both these cases are seen as the kubelet on a Node being unresponsive by the Kubernetes master. If in the second case, we were to quickly create a replacement pod on a different Node, we may violate the at-most-one semantics guaranteed by StatefulSet, and have multiple pods with the same identity running on different nodes. At worst, this could even lead to split brain and data loss when running Stateful applications.
On most cloud providers, when an instance is deleted, Kubernetes can figure out that the Node is also deleted, and hence let the StatefulSet pod be recreated elsewhere.
However, if you're running on-prem, this may not happen. It is recommended that you delete the Node object from kubernetes as you power it down, or have a reconciliation loop keeping the Kubernetes idea of Nodes in sync with the the actual nodes available.
Some more context is in the github issue.

Resources