Create new pod when old pod dies or crossed threshold - linux

I am new bee to Kubernetes and I am doing some workaround on these pods.
I have 3 pods running in 3 different nodes. One of the Pod App is taking more usage 90+ and I want to create a health check for that.
Is there any way for creating a health check in Kubernetes ?
If I mention 80 CPU limit, Kubernetes will create new pod or not ?

You need a Horizontal Pod Autoscaler to scale pods. There is a simple guide that will walk you through creating one. Here's a resource example from the mentioned guide:
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50

as mentioned in the below answer , you are supposed to create a horizontal autoscaler component for that deployment object, the kubernetes metrics server will continuously be keeping watch on CPU utilization against each POD and once the usage crosses the threshold i.e. "averageUtilization: 50" ( as mentioned below ), then a new pod will get spawned once the existing pod reaches 50% of the CPU provided to it.
And this is different from health check thing, as health of a pod is decides whether to send traffic on that or not i.e. via liveness and readiness probes.
Make sure you mention the resources and limits for the POD in the deployment file that you create, so that HPA can take a reference value of CPU against which it will be calculating the percentage of utilization.

Related

How to request scale up from 0 to X the number of nodes in a Nodepool in Azure Kubernetes using nodeSelector?

I have a kubernetes cluster (v1.24.3) running in Azure with 3 nodepools called small, standard and large. For each of these nodepools I have added a label named type, where the value is SMALL-2CPU-8GB, STANDARD-4CPU-16GB and LARGE-8CPU-32GB respectively. These nodepools are also configured with the autoscaler from Azure, and the min is 0 and the max is 10.
Now, I am deploying my applications which are required to run in each of these nodepools depending on the specification - for example, one of the apps requires a small node, so it is requesting to run in the nodepool called small with a label type=SMALL-2CPU-8GB and so on.
The way I am requesting this is by setting the nodeSelector in the manifest of the application. Exactly this is the portion of the template:
# App 1
podTemplate:
spec:
nodeSelector:
type: LARGE-8CPU-32GB
agentpool: large
# App 2
podTemplate:
spec:
nodeSelector:
type: STANDARD-4CPU-16GB
agentpool: standard
# App 3
podTemplate:
spec:
nodeSelector:
type: SMALL-4CPU-16GB
agentpool: small
...
When I apply the manifest to the cluster, the pods are in pending state with the message:
Normal NotTriggerScaleUp 43m (x13 over 45m) cluster-autoscaler pod didn't trigger scale-up: 3 node(s) didn't match Pod's node affinity/selector, 1 not ready for scale-up
And I can see that the node count is still 0, so the pod is not triggering the autoscaler to request a new node.
My question is, how to make the autoscaler work when I am requesting nodes (even when the nodepool has zero nodes) via the nodeSelector? Should I specify a different label or use taints?

AKS PersistentVolume Affinity?

Disclaimer: This question is very specific about the used platforms and the UseCase we are trying to solve with it. Also it compares two approaches we currently use at least in a development stage and are trying to compare, but perhaps don't fully understand yet. I am asking for guidance on this very specific topic...
A) We are running a Kafka cluster as Kafka Tasks on DC/OS, where persistence of data is maintained via local Disk Storage which is provisioned on the very same host as the according kafka broker instance.
B) We are trying to run Kafka on Kubernetes (via Strimzi Operator), specifically Azure Kubernetes Service (AKS) and are struggling to get reliable Data Persistence using the StorageClasses you get in AKS. We tried three possibilities:
(Default) Azure Disk
Azure File
emptyDir
I see two major issues with Azure Disk, as we are able to set the Kafka Pod Affinity in a manner that they do not end up on the same maintenance zone / host, we have no instrument to bind the according PersistentVolume anywhere near the Pod. There is nothing like NodeAffinity for AzureDisks. Also it is fairly common that an Azure Disk ends up on another host than its corresponding pod, which might be limited by network bandwidth then?
With Azure File we don't have issues because of maintenance zones which are going down temporarily, but as a high latency storage option it doesn't seem to be a good fit and also Kafka has trouble to delete / update files on retention.
So I ended up using an ephemeral Storage Cluster which is commonly NOT recommended but doesn't come with the problems above. The Volume "lives" near the pod and is available to it as long as the pod itself runs on any node. In the maintenance case pod AND volume die together. As long as I am able to maintain a quorum, I don't see where this might cause issues.
Is there anything like podAffinity for PersistentVolumes as Azure-Disk is per definition Node bound?
What are the major downsides in using emptyDir for persistence in a Kafka Cluster on Kubernetes?
Is there anything like podAffinity for PersistentVolumes as Azure-Disk
is per definition Node bound?
As I know, there is nothing like podaffinity for PersistentVolumes as Azure-Disk. The azure disk should be attached to the node, so if the pod changes the host node, then the pod can't use the volume on that disk. Only the Azure file share is podAffinity.
What are the major downsides in using emptyDir for persistence in a
Kafka Cluster on Kubernetes?
You can take a look at the emptyDir:
scratch space, such as for a disk-based merge sort
This is the most thing you need to watch out for when you use the AKS. You need to calculate the disk space, perhaps you need to attach multiple Azure disks to the nodes.
Starting off - I'm not sure what you mean about an Azure Disk ending up on a node other than where the pod is assigned - that shouldn't be possible, per my understanding (for completeness, you can do this on a VM with the shared disks feature outside of AKS, but as far as I'm aware that's not supported in AKS for dynamic disks at the time of writing). If you're looking at the volume.kubernetes.io/selected-node annotation on the PVC, I don't believe that's updated after initial creation.
You can reach the configuration you're looking for by using a statefulset with antiaffinity. Consider this statefulset. It creates three pods, which must be in different availability zones. I'm deploying this to an AKS cluster with a nodepool (nodepool2) with two nodes per AZ:
❯ kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{","}{.metadata.labels.topology\.kubernetes\.io\/zone}{"\n"}{end}'
aks-nodepool1-25997496-vmss000000,0
aks-nodepool2-25997496-vmss000000,westus2-1
aks-nodepool2-25997496-vmss000001,westus2-2
aks-nodepool2-25997496-vmss000002,westus2-3
aks-nodepool2-25997496-vmss000003,westus2-1
aks-nodepool2-25997496-vmss000004,westus2-2
aks-nodepool2-25997496-vmss000005,westus2-3
Once the statefulset is deployed and spun up, you can see each pod was assigned to one of the nodepool2 nodes:
❯ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
echo-0 1/1 Running 0 3m42s 10.48.36.102 aks-nodepool2-25997496-vmss000001 <none> <none>
echo-1 1/1 Running 0 3m19s 10.48.36.135 aks-nodepool2-25997496-vmss000002 <none> <none>
echo-2 1/1 Running 0 2m55s 10.48.36.72 aks-nodepool2-25997496-vmss000000 <none> <none>
Each pod created a PVC based on the template:
❯ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
demo-echo-0 Bound pvc-bf6104e0-c05e-43d4-9ec5-fae425998f9d 1Gi RWO managed-premium 25m
demo-echo-1 Bound pvc-9d9fbd5f-617a-4582-abc3-ca34b1b178e4 1Gi RWO managed-premium 25m
demo-echo-2 Bound pvc-d914a745-688f-493b-9b82-21598d4335ca 1Gi RWO managed-premium 24m
Let's take a look at one of the PVs that was created:
apiVersion: v1
kind: PersistentVolume
metadata:
annotations:
pv.kubernetes.io/bound-by-controller: "yes"
pv.kubernetes.io/provisioned-by: kubernetes.io/azure-disk
volumehelper.VolumeDynamicallyCreatedByKey: azure-disk-dynamic-provisioner
creationTimestamp: "2021-04-05T14:08:12Z"
finalizers:
- kubernetes.io/pv-protection
labels:
failure-domain.beta.kubernetes.io/region: westus2
failure-domain.beta.kubernetes.io/zone: westus2-3
name: pvc-9d9fbd5f-617a-4582-abc3-ca34b1b178e4
resourceVersion: "19275047"
uid: 945ad69a-92cc-4d8d-96f4-bdf0b80f9965
spec:
accessModes:
- ReadWriteOnce
azureDisk:
cachingMode: ReadOnly
diskName: kubernetes-dynamic-pvc-9d9fbd5f-617a-4582-abc3-ca34b1b178e4
diskURI: /subscriptions/02a062c5-366a-4984-9788-d9241055dda2/resourceGroups/rg-sandbox-aks-mc-sandbox0-westus2/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-9d9fbd5f-617a-4582-abc3-ca34b1b178e4
fsType: ""
kind: Managed
readOnly: false
capacity:
storage: 1Gi
claimRef:
apiVersion: v1
kind: PersistentVolumeClaim
name: demo-echo-1
namespace: zonetest
resourceVersion: "19275017"
uid: 9d9fbd5f-617a-4582-abc3-ca34b1b178e4
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: failure-domain.beta.kubernetes.io/region
operator: In
values:
- westus2
- key: failure-domain.beta.kubernetes.io/zone
operator: In
values:
- westus2-3
persistentVolumeReclaimPolicy: Delete
storageClassName: managed-premium
volumeMode: Filesystem
status:
phase: Bound
As you can see, that PV has a required nodeAffinity for nodes in failure-domain.beta.kubernetes.io/zone with value westus2-3. This ensures that the pod that owns that PV will only ever get placed on a node in westus2-3, and that PV will be bound to the node the disk is running on when the pod is started.
At this point, I deleted all the pods to get them on the other nodes:
❯ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
echo-0 1/1 Running 0 4m4s 10.48.36.168 aks-nodepool2-25997496-vmss000004 <none> <none>
echo-1 1/1 Running 0 3m30s 10.48.36.202 aks-nodepool2-25997496-vmss000005 <none> <none>
echo-2 1/1 Running 0 2m56s 10.48.36.42 aks-nodepool2-25997496-vmss000003 <none> <none>
There's no way to see it via Kubernetes, but you can see via the Azure portal that managed disk kubernetes-dynamic-pvc-bf6104e0-c05e-43d4-9ec5-fae425998f9d, which backs pv pvc-bf6104e0-c05e-43d4-9ec5-fae425998f9d, which backs PVC zonetest/demo-echo-0, is listed as Managed by: aks-nodepool2-25997496-vmss_4, so it's been removed and assigned to the node where the pod is running.
Portal screenshot showing disk attached to node 4
If I were to remove nodes such that I didn't have nodes in AZ 3, I wouldn't be able to start pod echo-1, since it's bound to a disk in AZ 3, which can't be attached to a node not in AZ 3.

Azure Kubernetes - replica vs HPA?

What is the difference between replicas and HPA?
For sample, below deployment is configured with 3 replicas
apiVersion: apps/v1
kind: Deployment
metadata:
name: hello
spec:
**replicas: 3**
and the below HPA with 2-20 replicas
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: hello
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hello
**minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
targetAverageUtilization: 80**
does it mean that the above HPA will control the overall number of replicas irrespective of what is defined in the "deployment.yaml"? When HPA scales up would it add one more "deployment" replica or three more "deployment" replicas?
Yes the answer is, based on the observations those I had with AKS.
The deployment.yaml, asks for desired number of replicas, and hpa carries the variation around this based on metrics configured.
The desired state or replicas in deployment object ( when you do kubectl get deploy ), will give the current replicas as well as desired replicas always and you can see a variation there with the load.
So it will start with 3 instances and then it will try to keep min replicas always available ( hence the min-replicas in hpa and replicas in deployment file are kept same ), and then based on load computation against the provided metrics, it will scale up or down to min or max defined levels.
It is important to add to previous answer that deployment spec.replicas and HPA spec.minReplicas might be conflicting. When both configured some unexpected behaviour might arise.
If there is an HPA, it manages the amount of replicas according to it's settings. But while deployment is under control of an HPA, if you apply deployment config with set amount of replicas, it would override current desired amount of replicas and might scale your deployment unexpectedly.
For example if you have deployment spec.replicas set to 1, and HPA currently scaled your deployment to 5 replicas, when you apply deployment config it would set amount of desired replicas to 1 and immediately scale down your deployment. Then HPA takes back control, changes desired amount of replicas back to 5 and scales it up again.
Here is how this issue looks on my Grafana dashboard that tracks amount of running replicas
More on the topic:
Blog post about the problem.
Problem explained in Kubernetes GitHub issue.

Manage Docker containers at low scale

I have deployed 5 apps using Azure container instances, these are working fine, the issue I have is that currently, all containers are running all the time, which gets expensive.
What I want to do is to start/stop instances when required using for this a Master container or VM that will be working all the time.
E.G.
This master service gets a request to spin up service number 3 for 2 hours then shut it down and all other containers will be off until they receive a similar request.
For my use case, each service will be used for less than 5 hours a day most of the time.
Now, I know Kubernetes its an engine made to manage containers but all examples I have found are for high scale services, not for 5 services with only one container each, also not sure if Kubernetes allows to have all the containers off most of the time.
What I was thinking on is to handle all these throw some API, but I'm not fiding any service in Azure that allows something similar to this, I have only found options to create new containers, not to spin up and shut them down.
EDIT:
Also, this apps run process that are to heavy to have them on a serverless platform.
Solution is to define horizontal pod autoscaler for your deployment.
The Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment or replica set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). Note that Horizontal Pod Autoscaling does not apply to objects that can’t be scaled, for example, DaemonSets.
The Horizontal Pod Autoscaler is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The controller periodically adjusts the number of replicas in a replication controller or deployment to match the observed average CPU utilization to the target specified by user.
Configuration file should looks like this:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: hpa-images-service
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: example-deployment
minReplicas: 2
maxReplicas: 100
targetCPUUtilizationPercentage: 75
scaleRef should refer toyour deployment definition and minReplicas you can set as 0, value of targetCPUUtilization you can set according to your preferences.. Such approach should help you to save money due to termination pod which have high CPU utilization.
Kubernetes official documentation: kubernetes-hpa.
GKE autoscaler documentation: gke-autoscaler.
Useful blog about saving cash using GCP: kubernetes-google-cloud.

How to limit amount of pods with attached managed disks per node

Imagine there is a cluster with lots of different deployments running on it. Some pods uses PersistentVolumes (Azure Disks). There is a limit in Azure how much disks can be mounted to a VM and this leads to errors on scheduling like
Status=409 Code="OperationNotAllowed" Message="The maximum number of data disks allowed to be attached to a VM of this size is 8
Pods stay in
Waiting: Container creating
state forever, however some nodes were having much less pods with attached disks at the moment of scheduling. It would be great to limit amount of pods with attached disks per node so this error will never happen. I believe
podAntiAffinity
is what I need and I know I can restrict pods with same label from scheduling on same node, but I don't know how to allow it until node has maximum amount of pods with disks.
My installation is AKS.
az acs create \
--orchestrator-type=kubernetes \
--orchestrator-version 1.7.9 \
--resource-group <resource_group_here> \
--name=<name_here> \
...
KUBE_MAX_PD_VOLS is what you are looking for. By default it's value is 16 for Azure Disks. So you can either use instances which has same limit of attached disks (16) or set it to preferrable value. You can see where it's declared at github
You should set this environment variable in your scheduler declaration. I found my scheduler declaration in /etc/kubernetes/manifests/kube-scheduler.yaml. This is what it looks now:
apiVersion: "v1"
kind: "Pod"
metadata:
name: "kube-scheduler"
...
spec:
containers:
- name: "kube-scheduler"
...
env:
- name: KUBE_MAX_PD_VOLS
value: "8"
...
Note spec.containers.env.KUBE_MAX_PD_VOLS setting - it prevents from scheduling more than 8 disks on each node.
This way pods spread among nodes without any issues, pods which cannot fit stays in Pending state until they find enough nodes to fit in.

Resources