Azure kubernetes service node pool upgrades & patches - azure

I have some confusion on AKS Node pool upgrades and Patching. Could you please clarify on this.
I have one AKS node pool, which is having 4 nodes, so now I want to upgrade the kubernetes version only in two nodes of node pool. Is it possible?
if it is possible to upgrade in two nodes, then how we can upgrade remaining two nodes? and how we can find out which two nodes are having old kubernetes version instead of latest kubernetes version
While doing the Upgrade process, will it create two new nodes with latest kubernetes version, and then will it delete old nodes in node pool?
Actually azure automatically applies patches on nodes, but will it creates new nodes with new patches and deleted old nodes?

1. According to the docs:
you can upgrade specific node pool.
So the approach with additional node-pool mentioned by 4c74356b41.
Additional info:
Node upgrades
There is an additional process in AKS that lets you upgrade a cluster. An upgrade is typically to move to a newer version of Kubernetes, not just apply node security updates.
An AKS upgrade performs the following actions:
A new node is deployed with the latest security updates and Kubernetes version applied.
An old node is cordoned and drained.
Pods are scheduled on the new node.
The old node is deleted.
2. By default, AKS uses one additional node to configure upgrades.
You can control this process by increase --max-surge parameter
To speed up the node image upgrade process, you can upgrade your node images using a customizable node surge value.
3. Security and kernel updates to Linux nodes:
In an AKS cluster, your Kubernetes nodes run as Azure virtual machines (VMs). These Linux-based VMs use an Ubuntu image, with the OS configured to automatically check for updates every night. If security or kernel updates are available, they are automatically downloaded and installed.
Some security updates, such as kernel updates, require a node reboot to finalize the process. A Linux node that requires a reboot creates a file named /var/run/reboot-required. This reboot process doesn't happen automatically.
This tutorial summarize the process of Cluster Maintenance and Other Tasks

no, create another pool with 2 nodes and test your application there. or create another cluster. you can find node version with kubectl get nodes
it gradually updates nodes one by one (default). you can change these. spot instances cannot be upgraded.
yes, latest patch version image will be used

Related

How to scale up Kubernetes cluster with Terraform avoiding downtime?

Here's the scenario: we have some applications running on a Kubernetes cluster on Azure. Currently our production cluster has one Nodepool with 3 nodes which are fairly low on resources because we still don't have that many active users/requests simultaneously.
Our backend APIs app is running on three pods, one on each node. I was told I will have need to increase resources soon (I'm thinking more memory or even replacing the VMs of the nodes with better ones).
We structured everything Kubernetes related using Terraform and I know that replacing VMs in a node is a destructive action, meaning the cluster will have to be replaces, new config and all deployments, services and etc will have to be reapplied.
I am fairly new to the Kubernetes and Terraform world, meaning I can do the basics to get an application up and running but I would like to learn what is the best practice when it comes to scaling and performance. How can I perform such increase in resources without having any downtime of our services?
I'm wondering if having an extra Nodepool would help while I replace the VM's of the other one (I might be absolutely wrong here)
If there's any link, course, tutorial you can point me to it's highly appreciated.
(Moved from comments)
In Azure, when you're performing cluster upgrade, there's a parameter called "max surge count" which is equal to 1 by default. What it means is when you update your cluster or node configuration, it will first create one extra node with the updated configuration - and only then it will safely drain and remove one of old ones. More on this here: Azure - Node Surge Upgrade

How to patch GKE Managed Instance Groups (Node Pools) for package security updates?

I have a GKE cluster running multiple nodes across two zones. My goal is to have a job scheduled to run once a week to run sudo apt-get upgrade to update the system packages. Doing some research I found that GCP provides a tool called "OS patch management" that does exactly that. I tried to use it but the Patch Job execution raised an error informing
Failure reason: Instance is part of a Managed Instance Group.
I also noticed that during the creation of the GKE Node pool, there is an option for enabling "Auto upgrade". But according to its description, it will only upgrade the version of the Kubernetes.
According to the Blog Exploring container security: the shared responsibility model in GKE:
For GKE, at a high level, we are responsible for protecting:
The nodes’ operating system, such as Container-Optimized OS (COS) or Ubuntu. GKE promptly makes any patches to these images available. If you have auto-upgrade enabled, these are automatically deployed. This is the base layer of your container—it’s not the same as the operating system running in your containers.
Conversely, you are responsible for protecting:
The nodes that run your workloads. You are responsible for any extra software installed on the nodes, or configuration changes made to the default. You are also responsible for keeping your nodes updated. We provide hardened VM images and configurations by default, manage the containers that are necessary to run GKE, and provide patches for your OS—you’re just responsible for upgrading. If you use node auto-upgrade, it moves the responsibility of upgrading these nodes back to us.
The node auto-upgrade feature DOES patch the OS of your nodes, it does not just upgrade the Kubernetes version.
OS Patch Management only works for GCE VM's. Not for GKE
You should refrain from doing OS level upgrades in GKE, that could cause some unexpected behavior (maybe a package get's upgraded and changes something that will mess up the GKE configuration).
You should let GKE auto-upgrade the OS and Kubernetes. Auto-upgrade will upgrade the OS as GKE releases are inter-twined with the OS release.
One easy way to go is to signup your clusters to release channels, this way they get upgraded as often as you want (depending on the channel) and your OS will be patched regularly.
Also you can follow the GKE hardening guide which provide you with step to make sure your GKE clusters are as secured as possible

Need to upgrade AKS version from 1.14.8 to 1.15.10. Not sure if the Nodes will reboot with this or not

Need to upgrade AKS version from 1.14.8 to 1.15.10. Not sure if the Nodes will reboot with this or not.
Could anyone pls clear my doubt on this
If you are using higher level controllers such as deployment and running multiple replicas of the pod then you are not going to have a downtime in your application because kubernetes will guarantee that replicas of pod get distributed between different kubernetes nodes and when a particular node is cordoned/drained for upgrade or maintenance you still have other replica of the pod running in other nodes.
If you use pod directly then you are going to have downtime in your application while upgrade is happening.
Reading documetation we can find:
During the upgrade process, AKS adds a new node to the cluster that runs the specified Kubernetes version, then carefully cordon and drains one of the old nodes to minimize disruption to running applications. When the new node is confirmed as running application pods, the old node is deleted.
They will not be rebooted, only replaced with new ones.
When we try to upgrade by default AKS will to upgrade nodes by increasing the existing node capacity. So one extra node will be spinup with kubernetes version you are planning to upgrade.
Then using rolling strategy it will try to upgrade the nodes one by one.
It will move all the pods to new extra node and deletes the old node. This cycle continues until all nodes are updated with latest version.
If we have replicaset or deployment then there should be no downtime ideally.
We can also use the concept of podAntiAffinity so that no 2 pods will be in same node, and there will be no downtime

Cassandra 2+ HPC Deployment

I am trying to deploy Cassandra on a Linux Based HPC cluster and I need some guidelines if possible. Specifically, what is the difference between running Cassandra locally and in cluster.
When managing locally (in which case it runs smoothly) we duplicate the original files for every node inside our Cassandra directory and we apply the appropriate changes for IP address, rcp, JMX etc... however, when managing a network which files do we need to install in each node. The whole package with all the files or just some of the required ones
like, bin/cassandra.in.sh, conf/cassandra.yaml, bin/cassandra.
I am a little bit confused on what to store in each node separately so to start working on the cluster.
You need to install Cassandra on each node (VM), i.e. the whole package and then update config files as neccessary. As described here to configure cluster in a single data center you need:
Install Cassandra on each node
Configure cluster name
Configure seeds
Configure snitch, if needed

migrating cassandra from 1.1.2 to 1.2.6

My current cassandra version is 1.1.2, it is implemented with a single node cluster, i would like to upgrade it 1.2.6 with multiple nodes in the ring. is it a proper way to migrate it directly to 1.2.6 or i should follow version by version migration.
I found the upgrading steps from this link
http://fossies.org/linux/misc/apache-cassandra-1.2.6-bin.tar.gz:a/apache-cassandra-1.2.6/NEWS.txt.
There are 9 other releases available between this two versions.
I migrate a two cluster nodes from 1.1.6 to 1.2.6 without problems and without doing version by version. Anyway, you should take a closer look into:
http://www.datastax.com/documentation/cassandra/1.2/index.html?pagename=docs&version=1.2&file=index#upgrade/upgradeC_c.html#concept_ds_smb_nyr_ck
Because there are a lot of new features from version 1.2 like the partioners maybe you need to change some configurations for your cluster.
You may directly hop on to C1.2.6.
We migrated our 4-node cluster from C1.0.9 to C1.2.8 recently without any issues. This was a rolling upgrade i.e. upgrade one node at a time and after each upgrade of a node, allow the cluster to stabilize (depends upon the traffic during upgrade)
These are the steps that we followed:
Perform below on each node,
Run Disablegossip and disablethrift, such that this node is seen as DOWN by other nodes.
flush/drain the memtables, run compaction to merge SSTables
take snapshot and enable incremental backups
This stops all the other nodes/clients from writing to this node and since memtables are flushed to disk, startup times are fast as it need not walk-through commit logs.
stop Cassandra (though this node is down, cluster is available for write/read, so zero downtime)
upgrade sstables to new storage format using sstableupgrade
install/untar Cassandra 1.2.8 on the new locations
move upgraded sstables to appropriate location
merge Cassandra.yaml from previous version and current version by a manual diff (need to detail out difference)
start Cassandra
watch the startup messages to ensure the node comes up without difficulty and is shown in the ring with mixed 1.0.x/1.2.x

Resources