I have a case where I want to perform an inplace upgrade of the AKS cluster node pools vmSize, deletion of the full cluster is not possible.
One alternative that I have looked into is to perform `az aks nodepool delete' and then recreate it with a new vmSize.
Question here is: What is really happening under the hood, drain all and delete?
Should we first drain all the nodes in sequence, and then run the command to achieve 0 downtime? We are running multiple node pools
Anyother suggestion?
why you dont add a new node pool and migrate your workload to the new nodepool, then delete the old node pool.
You could also import this new node pool with Terraform if you use it.
Or is it the system node pool you are talking about?
Related
yesterday I was using kubectl in my command line and was getting this message after trying any command. Everything was working fine the previous day and I had not touched anything in my AKS.
Unable to connect to the server: x509: certificate has expired or is not yet valid: current time 2022-01-11T12:57:51-05:00 is after 2022-01-11T13:09:11Z
After doing some google to solve this issue I found a guide about rotating certificates:
https://learn.microsoft.com/en-us/azure/aks/certificate-rotation
After following the rotate guide it fixed my certificate issue however all my pods were still in a pending state so I then followed this guide: https://learn.microsoft.com/en-us/azure/aks/update-credentials
Then one of my nodepools started working again which is of type user but the one of type system is still in a failed state with all pods pending.
I am not sure of the next steps I should be taking to solve this issue. Does anyone have any recommendations? I was going to delete the nodepool and make a new one but I can't do that either because it is the last system node pool.
Assuming you are using API version older than 2020-03-01 for creating AKS cluster.
There are few limitations apply when you create and manage AKS clusters that support system node pools.
• An API version of 2020-03-01 or greater must be used to set a node
pool mode. Clusters created on API versions older than 2020-03-01
contain only user node pools, but can be migrated to contain system
node pools by following update pool mode steps.
• The mode of a node pool is a required property and must be
explicitly set when using ARM templates or direct API calls.
You can use the Bicep/JSON code provided in MS Document to create the AKS cluster as there is using upgaded API version.
You can also follow this MS Document if you want to Create a new AKS cluster with a system node pool and add a dedicated system node pool to the existing AKS cluster.
The following command adds a dedicated node pool of mode type system with a default count of three nodes.
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name systempool \
--node-count 3 \
--node-taints CriticalAddonsOnly=true:NoSchedule \
--mode System
Foreword
When you create a Kubernetes cluster on AKS you specify the type of VMs you want to use for your nodes (--node-vm-size). I read that you can't change this after you create the Kubernetes cluster, which would mean that you'd be scaling vertically instead of horizontally whenever you add resources.
However, you can create different node pools in an AKS cluster that use different types of VMs for your nodes. So, I thought, if you want to "change" the type of VM that you chose initially, maybe add a new node pool and remove the old one ("nodepool1")?
I tried that through the following steps:
Create a node pool named "stda1v2" with a VM type of "Standard_A1_v2"
Delete "nodepool1" (az aks nodepool delete --cluster-name ... -g ... -n nodepool1
Unfortunately I was met with Primary agentpool cannot be deleted.
Question
What is the purpose of the "primary agentpool" which cannot be deleted, and does it matter (a lot) what type of VM I choose when I create the AKS cluster (in a real world scenario)?
Can I create other node pools and let the primary one live its life? Will it cause trouble in the future if I have node pools that use larger VMs for its nodes but the primary one still using "Standard_A1_v2" for example?
Primary node pool is the first nodepool in the cluster and you cannot delete it, because its currently not supported. You can create and delete additional node pools and just let primary be as it is. It will not create any trouble.
For the primary node pool I suggest picking a VM size that makes more sense in a long run (since you cannot change it). B-series would be a good fit, since they are cheap and CPU\mem ratio is good for average workloads.
ps. You can always scale primary node pool to 0 nodes, cordon the node and shut it down. You will have to repeat this post upgrade, but otherwise it will work
It looks like this functionality was introduced around the time of your question, allowing you to add new system nodepools and delete old ones, including the initial nodepool. After encountering the same error message myself while trying to tidy up a cluster, I discovered I had to set another nodepool to a system type in order to delete the first.
There's more info about it here, but in short, Azure nodepools are split into two types ('modes' as they call it): System and User. When creating a single pool to begin with, it will be of a system type (favouring system pod scheduling -- so it might be good to have a dedicated pool of a node or two for system use, then a second user nodepool for the actual app pods).
So if you wish to delete your only system pool, you need to first create another nodepool with the --mode switch set to 'system' (with your preferred VM size etc.), then you'll be able to delete the first (and nodepool modes can't be changed after the fact, only on creation).
I'd like to move an instance of Azure Kubernetes Service to another subnet in the same virtual network. Is it possible or the only way to do this is to recreate the AKS instance?
No, it is not possible, you need to redeploy AKS
edit: 08.02.2023 - its actually possible to some extent now: https://learn.microsoft.com/en-us/azure/aks/configure-azure-cni-dynamic-ip-allocation#configure-networking-with-dynamic-allocation-of-ips-and-enhanced-subnet-support---azure-cli
I'm not sure it can be updated on an existing cluster without recreating it (or the nodepool)
I know its an old thread, but just responding in case someone might find it useful. You cannot change the subnet of the AKS directly. However, you can always change the subnets of the underlying components. In our case, we had a simple setup of 2 nodes and a LoadBalancer. We created a new subnet and change the subnets on these individual components. It worked for us, so do ensure to check the services and the pods, to ensure correct working.
I have created 6 disks of 256GB each on 2 windows server 2016 VMs. I need to implement Active-Active SQL failover cluster on these 2 VMs using S2D.
I am getting error while creating storage pool for 3 disks , below is the error
Cluster resource 'Cluster Pool 1' of type 'Storage Pool' in clustered role xxxxxx failed. The error code was '0x16' ('The device does not recognize the command.').
Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet
[Problem start date and time]
S2D is new in Windows Server 2016. You can check what to have before you process with building your failover cluster. It's strongly to validate cluster first and then enable S2D following Configure the Windows Failover Cluster with S2D.
this error is appearing because i tried to create storage pool again..basically, enable-s2dcluster has created the pool already for me..i didnt notice it and was trying to create the pool using Failove cluster manager
In order to achieve an active-active solution, you should configure a host/VM per location. For Azure, S2D does not work between two locations. It requires RDMA support for the performance that cannot be configured in Azure. So, to get HA for SQL FCI to check StarWind vSAN free that can be configured between sites replicating/mirroring storage. https://www.starwindsoftware.com/resource-library/installing-and-configuring-a-sql-server-failover-clustered-instance-on-microsoft-azure-virtual-machines
I see the following configuration: Storage Spaces provides disk redundancy configuring mirror or parity for each VM and StarWind distributes HA storage on top of underlying Storage Spaces.
I would like to stop and deallocate the nodes in a kubernetes cluster in azure so It cannot be billed during weekends for example. I just can set to a minimun of 1 node using the az CLI
Any idea will be appreciated
If your are using scale sets (used for autoscaling and multiple node pools) you can deallocate the scale sets via ui (search in the resource the scale sets are created) or via az cli az-vmss-deallocate
Usually if we use PowerShell then Automation runbook can work for us. Unfortunately I could not find PowerShell to start/ stop AKS. Azure CLI do support AKS start and stop. Also it maintains the state of AKS cluster so that you can start where you had left.
One way is to use Azure CLI and find a way to automate on your own. Refer this link for Azure CLI of start stop automation - https://learn.microsoft.com/en-us/azure/aks/start-stop-cluster?tabs=azure-cli
Other way is to use ready solution. I found this marketplace solution that runs a VM and shuts down/ starts AKS cluster on the time specified. Refer this link for the deployment -
https://azuremarketplace.microsoft.com/en-in/marketplace/apps/bowspritconsultingopcprivatelimited1596291408582.aksautomation2?tab=Overview
If you are good with automation and writing scripts then use option 1 else go for option 2.
You can use AKS Start/Stop option or Stop VMSS manually by going to Infrastructure Resource Group, but that isn't supported and shouldn't be used.
https://learn.microsoft.com/en-us/azure/aks/start-stop-cluster?tabs=azure-cli
Also, for any user Nodepool, you can put the count to 0 now.
https://learn.microsoft.com/en-us/azure/aks/scale-cluster