Minimal viable alertmanager configuration without paging - prometheus-alertmanager

I'm trying to get alertmanager running without paging. All I want to do is see the alerts on the alertmanager web page but I can't find a config which doesn't want to send pages somewhere.
If I exclude the config file I get an error.
component=configuration msg="Loading configuration file failed"
file=alertmanager.yml err="open alertmanager.yml: no such file or
directory"
I have commented out that file and the associated command but obviously it's mandatory. Can somebody point me to a config which works but doesn't want to send alert notifications anywhere?

I think super-minimal should be something like:
global:
resolve_timeout: 5m
route:
receiver: "null"
group_by:
- job
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receivers:
- name: "null"
Less minimal:
global:
resolve_timeout: 5m
route:
receiver: main
group_by:
- job
routes:
- receiver: "null"
match:
alertname: Watchdog
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receivers:
- name: "null"
- name: main
# Here define the receiver
# Add templates if necessary
#templates:
#- /etc/alertmanager/*.tmpl

Related

Messages not picked up by queue after KEDA scaling in AKS

I have deployed an AKS cluster which I also installed KEDA on. Now this KEDA scaler will schedule a pod for me when a new message appears in an Azure Storage Queue. Each message is processed by one pod, and each pod (except for system pods) will be scheduled on a new node due to it being very resource expensive.
The application on the pod is essentially just an Azure Function written in Node. I found it easier to use it this way because now I don't have to use any SDK's and can just use bindings.
Now KEDA successfully launches one scaled job per message, but when the pod(s) start running I do see the logs of the Azure Functions Runtime saying that it started, but not that it's being triggered by the message in the queue. If I check the queue I also don't see the messages being picked up, they're still there.
Now I also checked the storage connection string environment variable and that's correct.
I found two places where I can make configurations which help me with this: the host.json in the Azure Function and the deployment.yml which I applied for the ScaledJob.
In the host.json I've set the batchSize to 1, batchTreshold to 0.
Below is my yaml to apply. Please don't mind the indentation, it's correct in the file as it's applying fine.
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: places-job
namespace: places
spec:
jobTargetRef:
parallelism: 10
completions: 1
backoffLimit: 3
template:
metadata:
namespace: places
labels:
app: places
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- places
topologyKey: kubernetes.io/hostname
containers:
- name: places
image: {azure-container-registry}:latest
env:
- name: AzureFunctionsJobHost__functions__0
value: places
envFrom:
- secretRef:
name: places
readinessProbe:
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 240
httpGet:
path: /
port: 80
scheme: HTTP
startupProbe:
failureThreshold: 3
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 240
httpGet:
path: /
port: 80
scheme: HTTP
resources:
limits:
cpu: 4
memory: 10G
restartPolicy: Never
pollingInterval: 10 # Optional. Default: 30 seconds
successfulJobsHistoryLimit: 1 # Optional. Default: 100. How many completed jobs should be kept.
failedJobsHistoryLimit: 1 # Optional. Default: 100. How many failed jobs should be kept.
triggers:
- type: azure-queue
metadata:
queueName: places-requests
queueLength: "1"
activationQueueLength: "0"
connectionFromEnv: AzureWebJobsStorage
accountName: {storage-account-name}
Does anyone know any other setting I might have to tweak so that the newly deployed pods/nodes do pick up the messages from the queue?

Grafana Loki with promtail no json parsing

I have two Grafana Loki installations. Done with helm from official repository.
Both are exact the same configured (expecting DNS)
The only difference is, one is on Azure and one is on own Esxi.
The problem I have is the log file parsing. The installation on Azure seems to parse the log files always with - cri {} settings and not with - docker {}
A quick search inside the promtail pods show me inside the promtail.yaml the - docker {} setting. But I always get the output:
2023-01-16 10:39:15
2023-01-16T09:39:15.604384089Z stdout F {"level":50,"time":1673861955603,"service
On our own Esxi I have the correct:
2023-01-13 16:58:18
{"level":50,"time":1673625498068,"service"
From what I read the stdout F is -cri {} parsing, default by promtail.
Any idea why this happen? My installation yaml is:
#helm upgrade --install loki --namespace=monitoring grafana/loki-stack -f value_mus.yaml
grafana:
enabled: true
admin:
existingSecret: grafana-admin-credentials
sidecar:
datasources:
enabled: true
maxLines: 1000
image:
tag: latest
persistence:
enabled: true
size: 10Gi
storageClassName: managed-premium
accessModes:
- ReadWriteOnce
grafana.ini:
users:
default_theme: light
server:
domain: xxx
smtp:
enabled: true
from_address: xxx
from_name: Grafana Notification
host: xxx
user: xxx
password: xxx
skip_verify: false
startTLS_policy:
promtail:
enabled: true
config:
snippets:
pipelineStages:
- docker: {}
Any help will be welcome.

VictoriaMetrics - pass filters in azure_sd_config like ec2_sd_config

I have to make it work for azure platform, the solution for scrape_config of vmagent was working fine with AWS but unable to find similar solution in Azure. In this particular snippet we have configured scraping config for node_exporter from VMs having tag key: mon_exporters with value: node. Checked the official documentation https://docs.victoriametrics.com/sd_configs.html#azure_sd_configs but couldn't find any mention of filter option
Is there any way I can filter out the VMs basis my needs because right now it fetches all the VMs in that particular Subscription
- job_name: 'node_exporter'
honor_timestamps: true
scrape_interval: 1m
scrape_timeout: 15s
metrics_path: /metrics
scheme: http
follow_redirects: true
azure_sd_configs:
- subscription_id: 'xxxxx'
authentication_method: 'ManagedIdentity'
environment: 'AzurePublicCloud'
refresh_interval: 5m
port: 9100
filters:
- name: 'tag:mon_exporters'
values: ["*node*"]
azure_sd_config in VictoriaMetrics doesn't support filters option. But you can keep needed targets with action: keep relabeling on __meta_azure_machine_tag_mon_exporters label. Try the following config:
- job_name: 'node_exporter'
scrape_interval: 1m
azure_sd_configs:
- subscription_id: 'xxxxx'
authentication_method: 'ManagedIdentity'
port: 9100
relabel_configs:
- action: keep
if: '{__meta_azure_machine_tag_mon_exporters="node"}'
See more details about this type of relabeling here

Kind (Kubernetes) cluster throwing ImagePullBackOff error

I need to pull the image from public docker repository i.e hello-world:latest and run that image on kubernetes. I created cluster using Kind . I ran that image using the below command
kubectl run test-pod --image=hello-world
Then I did
kubectl describe pods
to get the status of the pods. It threw me ImagePullBackOff error . Please find the snapshot below. It seems there is some network issue when pulling the image using kind cluster. Although I am able to pull image from docker easily .
Have searched the whole internet regarding this issue but nothing worked out. Following is my pod specification :
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: "2022-05-16T15:01:17Z"
labels:
run: test-pod
name: test-pod
namespace: default
resourceVersion: "4370"
uid: 6ef121e2-805b-4022-9a13-c17c031aea88
spec:
containers:
- image: hello-world
imagePullPolicy: Always
name: test-pod
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-jjsmp
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
nodeName: kind-control-plane
preemptionPolicy: PreemptLowerPriority
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
volumes:
- name: kube-api-access-jjsmp
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2022-05-16T15:01:17Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2022-05-16T15:01:17Z"
message: 'containers with unready status: [test-pod]'
reason: ContainersNotReady
status: "False"
type: Ready
containerStatuses:
- image: hello-world
imageID: ""
lastState: {}
name: test-pod
ready: false
restartCount: 0
started: false
state:
waiting:
message: Back-off pulling image "hello-world"
reason: ImagePullBackOff
hostIP: 172.18.0.2
phase: Pending
podIP: 10.244.0.5
podIPs:
- ip: 10.244.0.5
qosClass: BestEffort
startTime: "2022-05-16T15:01:17Z"
The ImagePullBackOff error means that Kubernetes couldn't pull the image from the registry and will keep trying to pull the image until it reaches a compiled-in limit of 300 seconds (5 minutes). This issue could happen because Kubernetes is facing one of the following conditions:
You have exceeded the rate or download limit on the registry.
The image registry requires authentication.
There is a typo in the image name or tag.
The image or tag does not exist.
You can start reviewing if you can pull the image locally or try to ssh jumping on the node and run docker pull and get the image directly.
If you still can't pull the image, another option is to add 8.8.8.8 to /etc/resolv.conf.
Update:
To avoid exposing your kind cluster to the internet try to pull the image locally at your PC by manually specifying a new path from a different registry.
Sample:
docker pull myregistry.local:5000/testing/test-image

How to monitor Fastify app with Prometheus and Grafana?

I am learning to monitor my Fastify app with Prometheus and Grafana. First, I installed fastify-metrics package and registered in Fastify app.
// app.ts
import metrics from 'fastify-metrics'
...
app.register(metrics, {
endpoint: '/metrics',
})
Then I setup Prometheus and Grafana in docker-compose.yml:
version: "3.7"
services:
prometheus:
image: prom/prometheus:latest
volumes:
- prometheus_data:/prometheus
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
network_mode: host
ports:
- '9090:9090'
grafana:
image: grafana/grafana:latest
volumes:
- grafana_data:/var/lib/grafana
# - ./grafana/provisioning:/etc/grafana/provisioning
# - ./grafana/config.ini:/etc/grafana/config.ini
# - ./grafana/dashboards:/var/lib/grafana/dashboards
environment:
- GF_SECURITY_ADMIN_PASSWORD=ohno
depends_on:
- prometheus
network_mode: host
ports:
- '3000:3000'
volumes:
prometheus_data: {}
grafana_data: {}
I added network_mode=host because Fastfy app will be running at localhost:8081.
Here's the Prometheus config:
# prometheus.yml
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 1m
scrape_configs:
- job_name: 'prometheus'
# metrics_path: /metrics
static_configs:
- targets: [
'app:8081',
]
- job_name: 'node_exporter'
static_configs:
- targets: [
'localhost:8081',
]
After docker-compose up and npm run dev, Fastify app is up and running and target localhost:8081 is UP in Prometheus dashboard, localhost:9090, I tried executing some metrics.
I imported Node Exporter Full and Node Exporter Server Metrics dashboards. And added Prometheus datasource localhost:9090, named Fastify, and saved successfully, showed Data source is working.
However, when I go to the Node Exporter Full dashboard, it shows no data. I selected Fastify in datasource but it shows None in others selections at upper left corner.
Please help, what I am doing wrong?
It looks like you're using a dashboard intended for linux stats. In order to use Prometheus/Grafana with your Fastify app, you'll need a dashboard that's meant for Node.js apps. For example:
https://grafana.com/grafana/dashboards/11159
https://grafana.com/grafana/dashboards/12230
Plugging one of those in should do the trick.
you should specify the metrics_path in the job as defined in your 'fastify-metrics' endpoint and also update the targets param:
- job_name: 'node_exporter'
scrape_interval: 5s
metrics_path: /metrics
scheme: http
static_configs:
- targets: ['localhost:8081']
labels:
group: 'node_exporter'

Resources