How to patch GKE Managed Instance Groups (Node Pools) for package security updates? - security

I have a GKE cluster running multiple nodes across two zones. My goal is to have a job scheduled to run once a week to run sudo apt-get upgrade to update the system packages. Doing some research I found that GCP provides a tool called "OS patch management" that does exactly that. I tried to use it but the Patch Job execution raised an error informing
Failure reason: Instance is part of a Managed Instance Group.
I also noticed that during the creation of the GKE Node pool, there is an option for enabling "Auto upgrade". But according to its description, it will only upgrade the version of the Kubernetes.

According to the Blog Exploring container security: the shared responsibility model in GKE:
For GKE, at a high level, we are responsible for protecting:
The nodes’ operating system, such as Container-Optimized OS (COS) or Ubuntu. GKE promptly makes any patches to these images available. If you have auto-upgrade enabled, these are automatically deployed. This is the base layer of your container—it’s not the same as the operating system running in your containers.
Conversely, you are responsible for protecting:
The nodes that run your workloads. You are responsible for any extra software installed on the nodes, or configuration changes made to the default. You are also responsible for keeping your nodes updated. We provide hardened VM images and configurations by default, manage the containers that are necessary to run GKE, and provide patches for your OS—you’re just responsible for upgrading. If you use node auto-upgrade, it moves the responsibility of upgrading these nodes back to us.
The node auto-upgrade feature DOES patch the OS of your nodes, it does not just upgrade the Kubernetes version.

OS Patch Management only works for GCE VM's. Not for GKE
You should refrain from doing OS level upgrades in GKE, that could cause some unexpected behavior (maybe a package get's upgraded and changes something that will mess up the GKE configuration).
You should let GKE auto-upgrade the OS and Kubernetes. Auto-upgrade will upgrade the OS as GKE releases are inter-twined with the OS release.
One easy way to go is to signup your clusters to release channels, this way they get upgraded as often as you want (depending on the channel) and your OS will be patched regularly.
Also you can follow the GKE hardening guide which provide you with step to make sure your GKE clusters are as secured as possible

Related

How to scale up Kubernetes cluster with Terraform avoiding downtime?

Here's the scenario: we have some applications running on a Kubernetes cluster on Azure. Currently our production cluster has one Nodepool with 3 nodes which are fairly low on resources because we still don't have that many active users/requests simultaneously.
Our backend APIs app is running on three pods, one on each node. I was told I will have need to increase resources soon (I'm thinking more memory or even replacing the VMs of the nodes with better ones).
We structured everything Kubernetes related using Terraform and I know that replacing VMs in a node is a destructive action, meaning the cluster will have to be replaces, new config and all deployments, services and etc will have to be reapplied.
I am fairly new to the Kubernetes and Terraform world, meaning I can do the basics to get an application up and running but I would like to learn what is the best practice when it comes to scaling and performance. How can I perform such increase in resources without having any downtime of our services?
I'm wondering if having an extra Nodepool would help while I replace the VM's of the other one (I might be absolutely wrong here)
If there's any link, course, tutorial you can point me to it's highly appreciated.
(Moved from comments)
In Azure, when you're performing cluster upgrade, there's a parameter called "max surge count" which is equal to 1 by default. What it means is when you update your cluster or node configuration, it will first create one extra node with the updated configuration - and only then it will safely drain and remove one of old ones. More on this here: Azure - Node Surge Upgrade

Azure kubernetes service node pool upgrades & patches

I have some confusion on AKS Node pool upgrades and Patching. Could you please clarify on this.
I have one AKS node pool, which is having 4 nodes, so now I want to upgrade the kubernetes version only in two nodes of node pool. Is it possible?
if it is possible to upgrade in two nodes, then how we can upgrade remaining two nodes? and how we can find out which two nodes are having old kubernetes version instead of latest kubernetes version
While doing the Upgrade process, will it create two new nodes with latest kubernetes version, and then will it delete old nodes in node pool?
Actually azure automatically applies patches on nodes, but will it creates new nodes with new patches and deleted old nodes?
1. According to the docs:
you can upgrade specific node pool.
So the approach with additional node-pool mentioned by 4c74356b41.
Additional info:
Node upgrades
There is an additional process in AKS that lets you upgrade a cluster. An upgrade is typically to move to a newer version of Kubernetes, not just apply node security updates.
An AKS upgrade performs the following actions:
A new node is deployed with the latest security updates and Kubernetes version applied.
An old node is cordoned and drained.
Pods are scheduled on the new node.
The old node is deleted.
2. By default, AKS uses one additional node to configure upgrades.
You can control this process by increase --max-surge parameter
To speed up the node image upgrade process, you can upgrade your node images using a customizable node surge value.
3. Security and kernel updates to Linux nodes:
In an AKS cluster, your Kubernetes nodes run as Azure virtual machines (VMs). These Linux-based VMs use an Ubuntu image, with the OS configured to automatically check for updates every night. If security or kernel updates are available, they are automatically downloaded and installed.
Some security updates, such as kernel updates, require a node reboot to finalize the process. A Linux node that requires a reboot creates a file named /var/run/reboot-required. This reboot process doesn't happen automatically.
This tutorial summarize the process of Cluster Maintenance and Other Tasks
no, create another pool with 2 nodes and test your application there. or create another cluster. you can find node version with kubectl get nodes
it gradually updates nodes one by one (default). you can change these. spot instances cannot be upgraded.
yes, latest patch version image will be used

Need to upgrade AKS version from 1.14.8 to 1.15.10. Not sure if the Nodes will reboot with this or not

Need to upgrade AKS version from 1.14.8 to 1.15.10. Not sure if the Nodes will reboot with this or not.
Could anyone pls clear my doubt on this
If you are using higher level controllers such as deployment and running multiple replicas of the pod then you are not going to have a downtime in your application because kubernetes will guarantee that replicas of pod get distributed between different kubernetes nodes and when a particular node is cordoned/drained for upgrade or maintenance you still have other replica of the pod running in other nodes.
If you use pod directly then you are going to have downtime in your application while upgrade is happening.
Reading documetation we can find:
During the upgrade process, AKS adds a new node to the cluster that runs the specified Kubernetes version, then carefully cordon and drains one of the old nodes to minimize disruption to running applications. When the new node is confirmed as running application pods, the old node is deleted.
They will not be rebooted, only replaced with new ones.
When we try to upgrade by default AKS will to upgrade nodes by increasing the existing node capacity. So one extra node will be spinup with kubernetes version you are planning to upgrade.
Then using rolling strategy it will try to upgrade the nodes one by one.
It will move all the pods to new extra node and deletes the old node. This cycle continues until all nodes are updated with latest version.
If we have replicaset or deployment then there should be no downtime ideally.
We can also use the concept of podAntiAffinity so that no 2 pods will be in same node, and there will be no downtime

How to Scale out Gitlab EE

Currently I am running the whole gitlab EE as a single container. I need to scale out the service so that it can support more users and more operations/pull/push/Merge Requests etc simultanously.
I need to run a redis cluster of its own
I need to run a PG cluster separate
I need to integrate elasticsearch for search
But how can I scale out the remaning core gitlab services. Do they support a scale out architecture.
gitlab workhorse
unicorn ( gitlab rails )
sidekiq ( gitlab rails )
gitaly
gitlab shell
Do they support a scale out architecture.
Not exactly, considering the GitLab Omnibus image is one package with bundled dependencies.
But I never experienced so much traffic that it needed to be split up and scaled out.
There is though a proposal for splitting up the Omnibus image: gitlab-org/omnibus-gitlab issue 1800.
It points out to gitlab-org/build/CNG which does just what you are looking for:
Each directory contains the Dockerfile for a specific component of the infrastructure needed to run GitLab.
rails - The Rails code needed for both API and web.
unicorn - The Unicorn container that exposes Rails.
sidekiq - The Sidekiq container that runs async Rails jobs
shell - Running GitLab Shell and OpenSSH to provide git over ssh, and authorized keys support from the database
gitaly - The Gitaly container that provides a distributed git repos
The other option, using Kubernetes, is the charts/gitlab:
The gitlab chart is the best way to operate GitLab on Kubernetes. This chart contains all the required components to get started, and can scale to large deployments.
Some of the key benefits of this chart and corresponding containers are:
Improved scalability and reliability
No requirement for root privileges
Utilization of object storage instead of NFS for storage
The default deployment includes:
Core GitLab components: Unicorn, Shell, Workhorse, Registry, Sidekiq, and Gitaly
Optional dependencies: Postgres, Redis, Minio
An auto-scaling, unprivileged GitLab Runner using the Kubernetes executor
Automatically provisioned SSL via Let's Encrypt.
Update Sept. 2020:
GitLab 13.4 offers one feature which can help scaling out GitLab on-premise:
Gitaly Cluster majority-wins reference transactions (beta)
Gitaly Cluster allows Git repositories to be replicated on multiple warm Gitaly nodes. This improves fault tolerance by removing single points of failure.
Reference transactions, introduced in GitLab 13.3, causes changes to be broadcast to all the Gitaly nodes in the cluster, but only the Gitaly nodes that vote in agreement with the primary node persist the changes to disk.
If all the replica nodes dissented, only one copy of the change would be persisted to disk, creating a single point of failure until asynchronous replication completed.
Majority-wins voting improves fault tolerance by requiring a majority of nodes to agree before persisting changes to disk. When the feature flag is enabled, writes must succeed on multiple nodes. Dissenting nodes are automatically brought in sync by asynchronous replication from the nodes that formed the quorum.
See Documentation and Issue.

Drone slaves provided by CoreOs

I have a drone host and a CoreOS cluster with fleet.
The drone now have only unix:///var/run/docker.sock in the nodes menu.
As I understand, I could add other docker nodes defined by docker URLs and certificates. However once I have a CoreOS cluster, it seems logical to use that as the provider of the slaves. I am looking for a solution where
(1)I don't have to configure the nodes whenever the CoreOS cluster configration changes, and
(2) provides correct resource management.
I could think of the following solutions:
Expose docker uris in the CoreOS cluster nodes, and configure all of them directly in drone. In this case I would have follow CoreOs cluster changes manually. Resource management would probably conflict with that of fleet.
Expose docker uris in the CoreOS cluster nodes, and provide a DNS round-robin based access. Seems to be a terrible way of resource management, and would most probably conflict with feet.
Install Swarm on the CoreOs nodes. Resource management would probably conflict with that of fleet.
Have fleet or RKT expose a docker uri, and fleet/RKT would decide on which node the container runs on. The problem is that I could not find any way to do this.
Have drone.io use fleet or RKT. Same problem. Is it possible?
Is there any way to provide solutions for all of my requirements with drone.io and CoreOs?
As I understand, I could add other docker nodes defined by docker URLs
and certificates. However once I have a CoreOS cluster, it seems
logical to use that as the provider of the slaves.
The newest version of drone supports build agents. Build agents are installed per-server and will communicate with the central drone server to pull builds from the queue, execute and send back the results.
docker run \
-e DRONE_SERVER=http://my.drone.server \
-e DRONE_SECRET=passcode \
-v /var/run/docker.sock:/container/path/docker.sock \
drone/drone:0.5 agent
This allows you to add and remove agents on the fly without having to register or manage them at the server level.
I believe this should solve the basic problem you've outlined, although I'm not sure it will provide the level of integration you desire with fleet and coreos. Perhaps a coreos expert can augment my answer.

Resources