Google cloud load balancer monitoring dashboard provides number of healthy nodes:
Is it any metric / MQL which can be used to create an alert if one node considered as down?
There is no direct metric for the Load Balancer's backends, but what you can do is to create a Monitoring group containing them, and then set up an Uptime check for monitoring and alerting; you can follow these steps to accomplish that:
Tag all the backend instances:
gcloud compute instances add-tags <instance_name> --tags=lb-backend --zone <instance_zone>
Create the monitoring group using Resource Type = gce_instance AND the tag as discriminators:
Create the Uptime check and alerting policy for that group:
Related
On a k8s cluster (v1.23.12) running in Azure (AKS cluster) I have deployed helm chart azure-resourcemanager-exporter-1.0.4 from https://artifacthub.io/packages/container/azure-resourcemanager-exporter/azure-resourcemanager-exporter
Metrics are scraped from the k8s cluster using local instance of Prometheus
see Prometheus version
and forwarded to "Azure Monitor Managed Service for Prometheus" using remote write.
When executing the following PromQL on graph tab of local prometheus instance, I am getting the expected result:
sum by(dimensionValue) (azurerm_costmanagement_detail_actualcost{timeframe="MonthToDate", dimensionValue=~"microsoft.*"})
Fetches all series matching metric name and label filters and calculates sum over dimensions while preserving label "dimensionValue".
result of the query in Prometeus
When I execute the same query in Prometheus explorer blade of my Azure Monitor workspace instance, the query returns the sum of the metric as if the sum over label "dimensionValue" was not there.
query in Prometheus explorer
The label "dimensionValue" do exist in the labels of the metrics in the Azure Monitor workspace
metric label exists
I also tried to scrape the metrics from the exporter using Azure agent in the k8s cluster, using instructions in this article https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/prometheus-metrics-scrape-validate#create-prometheus-configuration-file
(not using remote write).
I get the same results when I execute the same query in Prometheus explorer.
I have created an ADF pipeline with Notebook activity. This notebook activity automatically creates databricks job clusters with autogenerated job cluster names.
1. Rename Job Cluster during runtime from ADF
I'm trying to rename this job cluster name with the process/other names during runtime from ADF/ADF linked service.
instead of job-59, i want it to be replaced with <process_name>_
2. Rename ClusterName Tag
Wanted to replace Default generated ClusterName Tag to required process name
Settings for the job can be updated using the Reset or Update endpoints.
Cluster tags allow you to easily monitor the cost of cloud resources used by various groups in your organization. You can specify tags as key-value pairs when you create a cluster, and Azure Databricks applies these tags to cloud resources like VMs and disk volumes, as well as DBU usage reports.
For detailed information about how pool and cluster tag types work together, see Monitor usage using cluster, pool, and workspace tags.
For convenience, Azure Databricks applies four default tags to each cluster: Vendor, Creator, ClusterName, and ClusterId.
These tags propagate to detailed cost analysis reports that you can access in the Azure portal.
Checkout an example how billing works.
I have deployed 5 apps using Azure container instances, these are working fine, the issue I have is that currently, all containers are running all the time, which gets expensive.
What I want to do is to start/stop instances when required using for this a Master container or VM that will be working all the time.
E.G.
This master service gets a request to spin up service number 3 for 2 hours then shut it down and all other containers will be off until they receive a similar request.
For my use case, each service will be used for less than 5 hours a day most of the time.
Now, I know Kubernetes its an engine made to manage containers but all examples I have found are for high scale services, not for 5 services with only one container each, also not sure if Kubernetes allows to have all the containers off most of the time.
What I was thinking on is to handle all these throw some API, but I'm not fiding any service in Azure that allows something similar to this, I have only found options to create new containers, not to spin up and shut them down.
EDIT:
Also, this apps run process that are to heavy to have them on a serverless platform.
Solution is to define horizontal pod autoscaler for your deployment.
The Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment or replica set based on observed CPU utilization (or, with custom metrics support, on some other application-provided metrics). Note that Horizontal Pod Autoscaling does not apply to objects that can’t be scaled, for example, DaemonSets.
The Horizontal Pod Autoscaler is implemented as a Kubernetes API resource and a controller. The resource determines the behavior of the controller. The controller periodically adjusts the number of replicas in a replication controller or deployment to match the observed average CPU utilization to the target specified by user.
Configuration file should looks like this:
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: hpa-images-service
spec:
scaleTargetRef:
apiVersion: apps/v1beta1
kind: Deployment
name: example-deployment
minReplicas: 2
maxReplicas: 100
targetCPUUtilizationPercentage: 75
scaleRef should refer toyour deployment definition and minReplicas you can set as 0, value of targetCPUUtilization you can set according to your preferences.. Such approach should help you to save money due to termination pod which have high CPU utilization.
Kubernetes official documentation: kubernetes-hpa.
GKE autoscaler documentation: gke-autoscaler.
Useful blog about saving cash using GCP: kubernetes-google-cloud.
Spark needs lots of resources to does its job. Kubernetes is great environment for resource management. How many Spark PODs do you run per node to have the best resource utilization?
Trying to run Spark Cluster on Kubernetes Cluster.
It depends on many factors. We need to know how much resources do you have and how much is being consumed by the pods. To do so you need to setup a Metrics-server.
Metrics Server is a cluster-wide aggregator of resource usage data.
Next step is to setup HPA.
The Horizontal Pod Autoscaler automatically scales the number of pods in a replication controller, deployment or replica set based on observed CPU utilization or other custom metrics. HPA normally fetches metrics from a series of aggregated APIs:
metrics.k8s.io
custom.metrics.k8s.io
external.metrics.k8s.io
How to make it work?
HPA is being supported by kubectl by default:
kubectl create - creates a new autoscaler
kubectl get hpa - lists your autoscalers
kubectl describe hpa - gets a detailed description of autoscalers
kubectl delete - deletes an autoscaler
Example:
kubectl autoscale rs foo --min=2 --max=5 --cpu-percent=80 creates an autoscaler for replication set foo, with target CPU utilization set to 80% and the number of replicas between 2 and 5. You can and should adjust all values to your needs.
Here is a detailed documentation of how to use kubectl autoscale command.
Please let me know if you find that useful.
I have a VM scale set that I want to set up auto-scaling for and I want to know how abrupt scaling down is. Before VMs get destroyed, I want to make sure any active long-running requests complete. Is this possible?
I am curious about the following:
How does auto-scaling decides which VMs to destroy when scaling down?
Is there any notification inside the VM that it is scheduled to be destroyed?
Can a VM that is scheduled to be destroyed control when it gets destroyed (and hold off destruction until all requests are complete)?
The VMs in my scale set will be behind a load balancer and I need to be able to drain connections (remove VMs from the backend pool) before destruction.
The autoscaling has several policies by which it selects which VMs to remove on scale-in, for example "NewestVM" will remove the ones which launched last, you can read more here: https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-scale-in-policy
Regarding notification inside the VM about termination, there's a new feature called "termination notification" that sends an event which you can read from localhost metadata, for example
curl -s -H "Metadata:true" "http://169.254.169.254/metadata/instance?api-version=2019-06-01"
Read more here: https://azure.microsoft.com/en-us/blog/azure-virtual-machine-scale-sets-now-provide-simpler-management-during-scalein/
The VM can either wait for termination timeout, or send a signal to metadata (POST request) to proceed with termination before timeout.
To drain connections, one of the methods is to block health probe IP address 168.63.129.16, so the VM will be "unhealthy" in load balancer or application-gateway, depends what you use, and no new traffic will be sent while old existing traffic will still be active.
How does auto-scaling decides which VMs to destroy when scaling down?
By default, auto-scaling will delete the larger Instance ID (for example, instances ID are 0,2,3, vmss will delete 3). We can use powershell to get the vmss vms' instance id.
PS C> Get-AzureRmVmssvm -ResourceGroupName "vmss" -VMScaleSetName "vmss"
ResourceGroupName Name Location Sku Capacity InstanceID ProvisioningState
----------------- ---- -------- --- -------- ---------- -----------------
VMSS vmss_0 westus Standard_D1_v2 0 Succeeded
VMSS vmss_2 westus Standard_D1_v2 2 Succeeded
Is there any notification inside the VM that it is scheduled to be
destroyed?
As far as I know, autoscale notifies the administrators and contributors of the resource by email, VM will not receive the notification.
Can a VM that is scheduled to be destroyed control when it gets
destroyed (and hold off destruction until all requests are complete)?
We can't hold off destruction until all requests are complete for now.
In most cases, we deploy vmss with load balancer which using a "round-robin" approach, the VMSS instances will not receive requests until the instances were deleted.
I want to make sure any active long-running requests complete. Is this
possible?
As far as I know, we can choose different OS metrics for autoscale, but we can't make sure VMSS will delete vm instances after the long-running requests complete.