How to modify the alert manager's threshold value - prometheus-alertmanager

I use a node exporter program to monitor the CPU's threshold and receive notifications.
The example below is a rule file that sets conditions when it is more than 80%.
alert: HostHighCpuLoad
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 0m
labels:
severity: warning
annotations:
summary: Host high CPU load (instance {{ $labels.instance }})
description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
I need to receive a threshold from the web UI and change the threshold set in the rule file to 70%.
Can I convert the threshold into a variable?
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > {THRESHOLD_VALUE}
Or I wonder if I should make a script that modifies the 80% value to 70%.
Please give me advice.

Related

snakemake allocates memory twice

I am noticing that all my rules request memory twice, one at a lower maximum than what I requested (mem_mb) and then what I actually requested (mem_gb). If I run the rules as localrules they do run faster. How can I make sure the default settings do not interfere?
resources: mem_mb=100, disk_mb=8620, tmpdir=/tmp/pop071.54835, partition=h24, qos=normal, mem_gb=100, time=120:00:00
The rules are as follows:
rule bwa_mem2_mem:
input:
R1 = "data/results/qc/{species}.{population}.{individual}_1.fq.gz",
R2 = "data/results/qc/{species}.{population}.{individual}_2.fq.gz",
R1_unp = "data/results/qc/{species}.{population}.{individual}_1_unp.fq.gz",
R2_unp = "data/results/qc/{species}.{population}.{individual}_2_unp.fq.gz",
idx= "data/results/genome/genome",
ref = "data/results/genome/genome.fa"
output:
bam = "data/results/mapped_reads/{species}.{population}.{individual}.bam",
log:
bwa ="logs/bwa_mem2/{species}.{population}.{individual}.log",
sam ="logs/samtools_view/{species}.{population}.{individual}.log",
benchmark:
"benchmark/bwa_mem2_mem/{species}.{population}.{individual}.tsv",
resources:
time = parameters["bwa_mem2"]["time"],
mem_gb = parameters["bwa_mem2"]["mem_gb"],
params:
extra = parameters["bwa_mem2"]["extra"],
tag = compose_rg_tag,
threads:
parameters["bwa_mem2"]["threads"],
shell:
"bwa-mem2 mem -t {threads} -R '{params.tag}' {params.extra} {input.idx} {input.R1} {input.R2} | "
"samtools sort -l 9 -o {output.bam} --reference {input.ref} --output-fmt CRAM -# {threads} /dev/stdin 2> {log.sam}"
and the config is:
cluster:
mkdir -p logs/{rule} && # change the log file to logs/slurm/{rule}
sbatch
--partition={resources.partition}
--time={resources.time}
--qos={resources.qos}
--cpus-per-task={threads}
--mem={resources.mem_gb}
--job-name=smk-{rule}-{wildcards}
--output=logs/{rule}/{rule}-{wildcards}-%j.out
--parsable # Required to pass job IDs to scancel
default-resources:
- partition=h24
- qos=normal
- mem_gb=100
- time="04:00:00"
restart-times: 3
max-jobs-per-second: 10
max-status-checks-per-second: 1
local-cores: 1
latency-wait: 60
jobs: 100
keep-going: True
rerun-incomplete: True
printshellcmds: True
scheduler: greedy
use-conda: True # Required to run with local conda enviroment
cluster-status: status-sacct.sh # Required to monitor the status of the submitted jobs
cluster-cancel: scancel # Required to cancel the jobs with Ctrl + C
cluster-cancel-nargs: 50
Cheers,
Angel
Right now there are two separate memory resource requirements:
mem_mb
mem_gb
From the perspective of snakemake these are different, so both will be passed to the cluster. A quick fix is to use the same units, e.g. if the resource really requires only 100 mb, then the default resource should be changed to:
default-resources:
- partition=h24
- qos=normal
- mem_mb=100

Is there any way to reduce this code of prometheus alert expressions? I have multiple similar expression only the source instance is different

Suppose I am getting the metrics from a service in event_processing_bucket tag
where instance are like source=ONE, source=TWO, source=THREE ...... TEN
Currently I am using the following way to get the alert, but here I have written a separate expression just because i have to get data for every single source.
Is there any way to reduce this duplicate code. so that i could write only one alert rule and it will alert for all separately based on its respective value
Here are the prometheus alert expressions,
- alert: ONE_SLA_GREATER_THAN_5DAYS
expr: sum(rate(event_processing_bucket{source="ONE"}[1m])) > 5
for: 1m
labels:
severity: warning
team: mySlackChannel
annotations:
description: ONE_SLA is GREATER_THAN_5DAYS
summary: ONE_SLA is GREATER_THAN_5DAYS
- alert: TWO_SLA_GREATER_THAN_5DAYS
expr: sum(rate(event_processing_bucket{source="TWO"}[1m])) > 5
for: 1m
labels:
severity: warning
team: mySlackChannel
annotations:
description: TWO_SLA is GREATER_THAN_5DAYS
summary: TWO_SLA is GREATER_THAN_5DAYS
.
.
.
- alert: TEN_SLA_GREATER_THAN_5DAYS
expr: sum(rate(event_processing_bucket{source="TEN"}[1m])) > 5
for: 1m
labels:
severity: warning
team: mySlackChannel
annotations:
description: TEN_SLA is GREATER_THAN_5DAYS
summary: TEN_SLA is GREATER_THAN_5DAYS
Please guide me to write single expression code if possible. if not please specify.
Thanks in advance!!
One way is to group by
histogram_quantile(0.95, sum(increase(event_bucket[5m])) by (le, source)) > 5
later result values can be used to trigger those many alerts

Appending a key to the top of an array

I have some hiera not unlike the following (I know this is invalid hiera with two keys... bare with me):
an::example::rule_files:
my_rules:
groups:
- name: my_rules
rules:
- alert: highCPU
expr: CPU > 90
for: 5m
annotations:
summary: "CPU is too high"
description: "CPU should be less than 90"
someone_elses_rules:
groups:
- name: someone_elses_rules
rules:
- alert: highCPU
expr: CPU > 70
for: 5m
annotations:
summary: "CPU is too high"
description: "CPU should be less than 70 on someone else's system"
I'm trying to turn this into a yaml file (the key is the filename). Now I know this is invalid hiera and I can remove the groups key to get this working (exactly what I've done), however when I try to reinsert it into the array, I can't get the formatting right. Here's the puppet code I'm using:
$alert_files = hiera('an::example::rule_files'),
$alert_files.each | String $alerts_file_name, Array $alert_config_pre | {
$prefix = [ "groups:" ]
$alert_config = $prefix + $alert_config_pre
file { "/etc/prometheus/${alerts_file_name}.rules":
ensure => file,
content => $alert_config.to_yaml,
}
}
Here's what I want:
cat /etc/prometheus/my_rules.rules
---
groups:
- name: my_rules
rules:
- alert: highCPU
expr: CPU > 90
for: 5m
annotations:
summary: CPU is too high
description: CPU should be less than 90
and here's what I get:
---
- 'groups:'
- name: my_rules
rules:
- alert: highCPU
expr: CPU > 90
for: 5m
annotations:
summary: CPU is too high
description: CPU should be less than 90
Any help would be massively appreciated. I feel like this should be simple but I've not really made any progress (I can't even remove the quotes from the word groups). If this is possible in either hiera or puppet (perhaps I've defined the hiera wrong) then great; any progress I can make in any way will be really appreciated.
This ...
$alert_files = hiera('an::example::rule_files'),
$alert_files.each | String $alerts_file_name, Array $alert_config_pre | {
... depends on the data associated with key an::example::rule_files to be a Hash with String keys and Array values. In the YAML presented at the top of the question, that item is instead a hash with String keys and Hash values. Inasmuch as the data seem to match the wanted file content, the problem seems to be not with the YAML (except for the inconsistent indentation), but rather with the Puppet code.
To work as you appear to want with the data you want, the Puppet code might look more like so:
$alert_files = lookup('an::example::rule_files'),
$alert_files.each |String $alerts_file_name, Hash $alert_config| {
file { "/etc/prometheus/${alerts_file_name}.rules":
ensure => 'file',
content => $alert_config.to_yaml,
}
}
Note that I have switched from the deprecated hiera() function to its replacement, lookup().

How to calculate CPU Utilization in Prometheus?

I'm new to Prometheus and I got confused about CPU usage metrics.
Here are my two cents about CPU usage. (Please correct me if I'm wrong.)
container_cpu_usage_seconds_total = container_cpu_user_seconds_total + container_cpu_system_seconds_total
"cores" = container_spec_cpu_quota / container_spec_cpu_period
This is the pod used CPU time in 1 second:
rate(container_cpu_usage_seconds_total{image!=""}[1m]) by (pod, namespace)
This is the pod allowed CPU time in 1 second:
(sum(container_spec_cpu_quota{image!=""}/100000) by (pod, namespace))
So the utilizations is like:
rate(container_cpu_usage_seconds_total{image!=""}[1m]) by (pod, namespace) / (sum(container_spec_cpu_quota{image!=""}/100000) by (pod, namespace)) *100
But what about those pods don't have spec.cpu.limits? I presume:
rate(container_cpu_usage_seconds_total{image!=""}[1m]) by (pod, namespace) / CORES_ON_NODE *100

Artillery.io: How to generate test report for each Scenario?

Artillery: How to run the scenarios sequentially and also display the results of each scenario in the same file?
I'm currently writing nodejs test with artillery.io to compare performance between two endpoints that I implemented. I defined two scenarios and I would like to get the result of each in a same report file.
The execution of the tests is not sequential, it means that at the end of the test I have a result already combined and impossible to know the performance of each one but for all.
config:
target: "http://localhost:8080/api/v1"
plugins:
expect: {}
metrics-by-endpoint: {}
phases:
- duration: 60
arrivalRate: 2
environments:
dev:
target: "https://backend.com/api/v1"
phases:
- duration: 60
arrivalRate: 2
scenarios:
- name: "Nashhorn"
flow:
- post:
url: "/casting/nashhorn"
auth:
user: user1
pass: user1
json:
body:
fromFile: "./casting-dataset-01-as-input.json"
options:
filename: "casting_dataset"
conentType: "application/json"
expect:
statusCode: 200
capture:
regexp: '[^]*'
as: 'result'
- log: 'result= {{result}}'
- name: "Nodejs"
flow:
- post:
url: "/casting/nodejs"
auth:
user: user1
pass: user1
json:
body:
fromFile: "./casting-dataset-01-as-input.json"
options:
filename: "casting_dataset"
conentType: "application/json"
expect:
statusCode: 200
capture:
regexp: '[^]*'
as: 'result'
- log: 'result= {{result}}'
How to run the scenarios sequentially and also display the results of each scenario in the same file?
Thank you in advance for your answers
I think you miss the param weight, this param defines de probability to execute the scenario. if in you first scenario put a weight of 1 and in the second put the same value, both will have the same probability to been execute (50%).
If you put in the first scenario a weight of 3 and in the second one a weight of 1, the second scenario will have a 25% probability of execution while the first one will have a 75% probability of being executed.
This combined with the arrivalRate parameter and setting the value of rampTo to 2, will cause 2 scenarios to be executed every second, in which if you set a weight of 1 to the two scenarios, they will be executed at the same time.
Look down for scenario weights in the documentation
scenarios:
- flow:
- log: Scenario for GET requests
- get:
url: /v1/url_test_1
name: Scenario for GET requests
weight: 1
- flow:
- log: Scenario for POST requets
- post:
json: {}
url: /v1/url_test_2
name: Scenario for POST
weight: 1
I hope this helps you.
To my knowledge, there isn't a good way to do this with the existing the artillery logic.
using this test script:
scenarios:
- name: "test 1"
flow:
- post:
url: "/postman-echo.com/get?test=123"
weight: 1
- name: "test 2"
flow:
- post:
url: "/postman-echo.com/get?test=123"
weight: 1
... etc...
Started phase 0 (equal weight), duration: 1s # 13:21:54(-0500) 2021-01-06
Report # 13:21:55(-0500) 2021-01-06
Elapsed time: 1 second
Scenarios launched: 20
Scenarios completed: 20
Requests completed: 20
Mean response/sec: 14.18
Response time (msec):
min: 117.2
max: 146.1
median: 128.6
p95: 144.5
p99: 146.1
Codes:
404: 20
All virtual users finished
Summary report # 13:21:55(-0500) 2021-01-06
Scenarios launched: 20
Scenarios completed: 20
Requests completed: 20
Mean response/sec: 14.18
Response time (msec):
min: 117.2
max: 146.1
median: 128.6
p95: 144.5
p99: 146.1
Scenario counts:
test 7: 4 (20%)
test 5: 2 (10%)
test 3: 1 (5%)
test 1: 4 (20%)
test 9: 2 (10%)
test 8: 3 (15%)
test 10: 2 (10%)
test 4: 1 (5%)
test 6: 1 (5%)
Codes:
404: 20
So basically you can see that they are weighted equally, but are not running equally. So I think there needs to be something added to the code itself for artillery. Happy to be wrong here.
You can use the per endpoint metrics plugin to give you the results per endpoint instead of aggregated.
https://artillery.io/docs/guides/plugins/plugin-metrics-by-endpoint.html
I see you already have this in your config, but it cannot be working if it is not giving you what you need. Did you install it as well as add to config?
npm install artillery-plugin-metrics-by-endpoint
In terms of running sequentially, I'm not sure why you would want to, but assuming you do, you just need to define each POST as part of the same Scenario instead of 2 different scenarios. That way the second step will only execute after the first step has responded. I believe the plugin is per endpoint, not per scenario so will still give you the report you want.

Resources