I'm trying to set up alerts to identify pods that exceed their memory/cpu requests and limits. I'm using Prometheus Helm chart v18.0.0 on a K8s v1.20 Amazon EKS cluster.
I also need to identify deployments that scale up/down often.
Is there a repo for such alerts so that I don't reinvent the wheel?
Related
I am running a service in AKS pods that would establish TLS connections with the client. There is a hard limit of 5K active connections per pod. I need a way to determine number of active TLS connections per pod and auto scale (HPA) when it reaches a threshold (say 3.5K TLS connections) and scale down when active connections are below 1K.
Is there a way to collect such metrics in AKS and scale based on that metrics. Kindly suggest.
By default, scale-up operations performed manually or by the cluster autoscaler require the allocation and provisioning of new nodes, and scale-down operations delete nodes. Scale-down Mode allows you to decide whether you would like to delete or deallocate the nodes in your Azure Kubernetes Service (AKS) cluster upon scaling down.
There is not any microsoft document that autoscale based on TLS
connection per pod.
Kubernetes has a cluster autoscaler, that adjusts the number of nodes based on the requested compute resources in the node pool. By default, the cluster autoscaler checks the Metrics API server every 10 seconds for any required changes in node count. If the cluster autoscale determines that a change is required, the number of nodes in your AKS cluster is increased or decreased accordingly. The cluster autoscaler works with Kubernetes RBAC-enabled AKS clusters that run Kubernetes 1.10.x or higher.
Cluster autoscaler is typically used alongside the horizontal pod autoscaler. When combined, the horizontal pod autoscaler increases or decreases the number of pods based on application demand, and the cluster autoscaler adjusts the number of nodes as needed to run those additional pods accordingly.
To get started with the cluster autoscaler in AKS, see Cluster Autoscaler on AKS.
Reference : https://learn.microsoft.com/en-us/azure/aks/concepts-scale#cluster-autoscaler
For Counting the TLS connection to particuler nodes can be done using Plateform metrics ->Microsoft.Blockchain/blockchainMembers->ClusterCommEgressTlsConnectionCount
You can refer the same here.
I want to find the Node scalability time on Azure Kubernetes Service (AKS) using Logs.
It's possible with some assumptions.
This information is taken from Azure AKS documentation (consider getting familiar with it, it describes how to enable, where to look at and etc):
To diagnose and debug autoscaler events, logs and status can be
retrieved from the autoscaler add-on.
AKS manages the cluster autoscaler on your behalf and runs it in the
managed control plane. You can enable control plane node to see the
logs and operations from CA (cluster autoscaler).
The same cluster-autoscaler is used across different platforms, each of them can have some specific setup (e.g. for Azure AKS). Based on it, logs should have events like:
status, scaleUp, scaleDown, eventResult
I am trying to figure out what is the trigger to scale AKS cluster out horizontally with nodes. I am having a cluster that runs on 103% CPU for 5+ minutes but there is no action taken. Any ideas what the triggers are and how I could customize them? If I start more jobs the cluster will lower the CPU allocation for all pods.
The article that MS has doesn't have anything specific around that https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler
You need to notice that:
The cluster autoscaler is a Kubernetes component. Although the AKS
cluster uses a virtual machine scale set for the nodes, don't manually
enable or edit settings for scale set autoscale in the Azure portal or
using the Azure CLI. Let the Kubernetes cluster autoscaler manage the
required scale settings.
Which brings us to the actual Kubernetes Cluster Autoscaler:
Cluster Autoscaler is a tool that automatically adjusts the size of
the Kubernetes cluster when one of the following conditions is true:
there are pods that failed to run in the cluster due to insufficient resources.
there are nodes in the cluster that have been underutilized for an extended period of time and their pods can be placed on other existing
nodes.
The first condition above is the trigger you are looking for.
To get more details regarding the installation and configuration you can go through the Cluster Autoscaler on Azure. For example, you can customize your CA based on the Resources:
When scaling from an empty VM Scale Set (0 instances), Cluster
Autoscaler will evaluate the provided presources (cpu, memory,
ephemeral-storage) based on that VM Scale Set's backing instance type.
This can be overridden (for instance, to account for system reserved
resources) by specifying capacities with VMSS tags, formated as:
k8s.io_cluster-autoscaler_node-template_resources_<resource name>: <resource value>. For instance:
k8s.io_cluster-autoscaler_node-template_resources_cpu: 3800m
k8s.io_cluster-autoscaler_node-template_resources_memory: 11Gi
I am planning to deploy 15 different applications initially and would endup with 300+ applications on azure kubernetes and would be using Prometheus and Grafana for monitoring.
I have deployed both the Prometheus and Grafana on a separate namespace on the dedicated node.
How do I enable horizontal pod scaling for Prometheus and Grafana?
You can scale your applications based on custom metrics gathered by Prometheus and presented in the Grafana dashboard.
In order to do that you'll need the Prometheus Adapter to implement the custom metrics API, which enables the HorizontalPodAutoscaler controller to retrieve metrics using the custom.metrics.k8s.io API. You can define your own metrics through the adapter’s configuration so the HPA would scale based on those stats.
Here you can find a short guide that would get you started.
I have a node auto-scaling shell script which takes care of auto-scaling the worker nodes based on the average CPU/memory of all the nodes in the Kubernetes cluster.
I currently run this script from a bastion where I have the pks, kubectl cli installed and have also configured a cron-job to run it every 5 minutes.
Is there any other way to do this in Kubernetes ( PKS on AWS) ?
Or may be without using a cron-job, as the auto-scaling becomes completely dependent on the cron.
Thanks
TL;DR: Autoscale with k8s
To setup autoscaling on k8s use:
kubectl autoscale -f <controller>.yaml --min=3 --max=5
Note: PKS over AWS is an overkill
You mentioned PKS
Using PKS over AWS infrastructure seems as overkill. Just because AWS has EKS
To work with AWS cloud, VMware recommends VMC on AWS
PKS autoscale
If you do insist to use PKS over AWS, you may try this sample repo: pks-autoscale
Author of the repo also has great PKS quickstart guide for aws
Scaling on AWS
EKS autoscaling
AWS EKS supports three-dimensional scaling:
Cluster Autoscaler — The Kubernetes Cluster Autoscaler automatically adjusts the number of nodes in your cluster when pods fail to launch due to lack of resources or when nodes in the cluster are underutilized and their pods can be rescheduled on to other nodes in the cluster.
Horizontal Pod Autoscaler — The Kubernetes Horizontal Pod Autoscaler automatically scales the number of pods in a deployment, replication controller, or replica set based on that resource's CPU utilization.
Vertical Pod Autoscaler — The Kubernetes Vertical Pod Autoscaler automatically adjusts the CPU and memory reservations for your pods to help "right size" your applications. This can help you to better use your cluster resources and free up CPU and memory for other pods.
EC2 Auto Scaling
If you decided to build your own k8s cluster using PKS, you may use EC2 auto scaling - just create an Auto Scaling Group.
Using aws-cli:
aws autoscaling create-auto-scaling-group --auto-scaling-group-name <my-asg> --launch-configuration-name <my-launch-config> --min-size 3 --max-size 5 --vpc-zone-identifier "<zones>
EC2 predictive scaling
Recently, AWS introduced predictive scaling for EC2:
... predictive scaling. Using data collected from your actual EC2 usage and further informed by billions of data points drawn from our own observations, we use well-trained Machine Learning models to predict your expected traffic (and EC2 usage) including daily and weekly patterns.
If you mean EKS on AWS than there are different auto-scaling options