How can I upgrade the AKS cluster using terraform without downtime - azure

I want to upgrade my AKS cluster using terraform without or with minimal downtime.
What happens to the workloads during the cluster upgrade.
Can i do the AKS cluster upgrade and node upgrade same time.
Azure provides the Scheduled AKS cluster maintenance (preview feature) , is it Azure does the cluster upgrade?

You have several questions listed here so I will try to answer them as best as I can. Your questions are generic and not related to Terraform, so I will address Terraform separately at the bottom.
What happens to the workloads during the cluster upgrade.
During an upgrade, it depends on whether Azure is doing the upgrade, or you are doing it manually. If Azure does the upgrade, it may be disruptive depending on the settings you choose when you create the cluster.
If you do the upgrade yourself, you can do it with no downtime, but it does require some azure cli usage due to how the AKS terraform code is designed.
Can i do the AKS cluster upgrade and node upgrade same time.
Yes. If your node is out of date and you schedule a cluster upgrade, the nodes will be brought up to date in the process of upgrading the cluster.
Azure provides the Scheduled AKS cluster maintenance (preview feature) , is it Azure does the cluster upgrade?
No. A different setting determines if Azure does the upgrade. This Scheduled Maintenance feature is designed to allow you to specify what times and days Microsoft is NOT allowed to do maintenance. The default when you don't specify a Scheduled Maintenance is that Microsoft may perform upgrades at any time:
https://learn.microsoft.com/en-us/azure/aks/planned-maintenance
Your AKS cluster has regular maintenance performed on it automatically. By default, this work can happen at any time. Planned Maintenance allows you to schedule weekly maintenance windows that will update your control plane as well as your kube-system Pods on a VMSS instance and minimize workload impact
The feature you are looking for regarding AKS performing cluster upgrades is called Cluster Autoupgrade, and you can read about that here: https://learn.microsoft.com/en-us/azure/aks/upgrade-cluster#set-auto-upgrade-channel-preview
Now in regards to performing a cluster upgrade with Terraform. Currently, due to how azurerm_kubernetes_cluster is designed, it is not possible to perform an upgrade of a cluster using only Terraform. Some azure-cli usage is required. It is possible to perform a cluster upgrade without downtime, but not possible by exclusively using Terraform. The steps to perform such an upgrade are detailed pretty well in this blog post: https://blog.gft.com/pl/2020/08/26/zero-downtime-migration-of-azure-kubernetes-clusters-managed-by-terraform/

AKS Cluster uses concept of buffer node when upgrade is performed. It brings a buffer node, move the workload to buffer node and upgrades the actual node. Time taken to upgrade the cluster depends on number of nodes in the cluster.
https://learn.microsoft.com/en-us/azure/aks/upgrade-cluster#upgrade-an-aks-cluster
You can upgrade Control Plane as well as Hosted Plane using Azure CLI.
#az aks upgrade --resource-group <ResourceGroup> --name <ClusterName> -k <KubernetesVersion>

Related

Vertical scaling of azure kubernetes cluster

I am unable to scale vertical my AKS cluster.
Currently, I have 3 nodes in my cluster with 2 core and 8 ram, I am trying to upgrade it with 16 code and 64 RAM, how do I do it?
I tried scaling the VM scale set, on Azure portal it shows it is scaled but when I do "kubectl get nodes -o wide" it still shows the old version.
Any leads will be helpful.
Thanks,
Abhishek
Vertical scaling or changing the node pool VM size is not supported. You need to create a new node pool and schedule your pods on the new nodes.
https://github.com/Azure/AKS/issues/1556#issuecomment-615390245
this UX issues is due to how the VMSS is managed by AKS. Since AKS is
a managed service, we don't support operations done outside of the AKS
API to the infrastructure resources. In this example you are using the
VMSS portal to resize, which uses VMSS APIs to resize the resource and
as a result has unexpected changes.
AKS nodepools don't support resize in place, so the supported way to
do this is to create a new nodepool with a new target and delete the
previous one. This needs to be done through the AKS portal UX. This
maintains the goal state of the AKS node pool, as at the moment the
portal is showing the VMSize AKS knows you have because that is what
was originally requested.

What you cannot do when you use AKS over self-managed kubernetes cluster?

I'm deciding if I should provide use vanilla kubernetes or use Azure Kubernetes Service for my CI build agents.
What control will I lose if used AKS; SSH inside cluster? turning on and off the VMS? How about the cost, I see that AKS use the VM pricing, is there something beyond that
There are several limitations which come to my mind, but neither of them should restrict your use case:
You lose control over master nodes (control plane). Shouldn't be an issue in your use case, and I hardly imagine where this may be a limitation. You still can SSH into worker nodes in AKS.
You lose fine-grained control over size of worker nodes. Node pools become an abstraction to control size of the VMs. In a self-managed cluster you can attach VMs of completely different size to the cluster. In AKS all the nodes in the same pool must be of the same size (but you can create several node pools with different VM sizes).
It's not possible to choose node's OS in AKS (it's Ubuntu-based).
You're not flexible in choosing network plugins for k8s. It's either kubenet or Azure CNI. But that's fine as long as you're not using some weird applications which requre L2 networking, more info here
There are definitely benefits of AKS:
You're not managing control plane which is a real pain reliever.
AKS can scale its nodes dynamically, which may be a good option for bursty workloads like build agents, but also imposes additional delay during node scaling procedure.
Cluster (control and data planes) upgrades are just couple of clicks in azure portal.
Control plane is free in AKS (in contrast e.g. to EKS in Amazon), you pay only for the worker nodes, you can calculate your price here

Does Azure Automation DSC respect Service Fabric update domains?

I have a Service Fabric cluster hosted in Microsoft Azure, and I have configured its scale set to register all nodes with Azure Automation DSC (following the example from https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/dsc-template#template-example-for-windows-virtual-machine-scale-sets).
I now need to update the DSC script to also ensure that TLS 1.0 is disabled. This registry change requires a reboot of the affected machines. How can I get DSC to apply this change one update domain at a time so that all the VMs in my cluster aren't rebooted at the same time?
This depends on the durability level that you have configured for your cluster:
Gold Restarts can be delayed until approved by the Service Fabric cluster. Updates can be paused for 2 hours per UD to allow additional time for replicas to recover from earlier failures
Silver Restarts can be delayed until approved by the Service Fabric cluster. Updates cannot be delayed for any significant period of time
Bronze Restarts will not be delayed by the Service Fabric cluster. Updates cannot be delayed for any significant period of time
So, you'll need your cluster to have either Silver or Gold level.

How to make a HDInsight/Spark cluster shrink when idle?

We use Spark 2.2 on Azure HDInsight for ad hoc exploration and batch jobs.
The jobs should run ok on a 5x medium VM cluster. They are
1. notebooks (Zeppelin with Livy.spark2 magics)
2. compiled jars being run with Livy.
I have to remember to scale this cluster down to 1 worker when not using it, to save money. (0 workers would be nice, if that were possible).
I'd like Spark to manage this for me... When a Job starts, scale the cluster up to a minimum size first, then pause ~10 mins while that completes. After an idle period without Jobs, scale down again.
You can use PowerShell or Azure classic CLI to scale up/down the cluster. But you might need to write a script to track the cluster resource usage and scale down automatically.
Here is a powershell syntax
Set-AzureRmHDInsightClusterSize -ClusterName <Cluster Name> -TargetInstanceCount <NewSize>
Here is a PowerShell workflow runbook that will help you automate the process of scaling in or out your HDInsight clusters depending on your needs
https://gallery.technet.microsoft.com/scriptcenter/Scale-your-HDInsight-f57bb4d8
or
You can use the below option to scale it manually (even though your question is how to scale up/down automatically, I thought it would be useful to someone who wants to scale up/down manually)
Below is the link for an article explaining different methods to scale the cluster using PowerShell or Classic CLI (remember: the latest CLI does n't support scaling feature)
https://learn.microsoft.com/en-us/azure/hdinsight/hdinsight-scaling-best-practices
If you want Spark to handle it dynamically, Azure Databricks is the best choice (but it is only Spark cluster, no Hadoop components (except Hive)). As HDInsight - Spark is not a Azure managed service, and will not solve your use case.
Below is the image of a new cluster (in Azure Data bricks) - I highlighted an "enable auto scaling option" which will allow you to scale dynamically when job is executed.
I'm told that Azure Databricks may be a better solution for this use case.

Who manages the nodes in an AKS cluster?

I started using the AKS service with 3 nodes setup. As I was curious I peeked at the provisioned VMs which are used as nodes. I noticed I can get root on these and that there need to be some updates installed. As I couldn't find anything in the docs, my question is: Who is in charge of managing the AKS nodes (vms).
Do I have to do this myself or what is the idea here?
Thank you in advance.
Azure automatically applies security patches to the nodes in your cluster on a nightly schedule. However, you are responsible for ensuring that nodes are rebooted as required.
You have several options for performing node reboots:
Manually, through the Azure portal or the Azure CLI.
By upgrading your AKS cluster. Cluster upgrades automatically cordon
and drain nodes, then bring them back up with the latest Ubuntu
image. Update the OS image on your nodes without changing Kubernetes
versions by specifying the current cluster version in az aks
upgrade.
Using Kured, an open-source reboot daemon for Kubernetes.
Kured runs as a DaemonSet and monitors each node for the presence of
a file indicating that a reboot is required. It then manages OS
reboots across the cluster, following the same cordon and drain
process described earlier.

Resources