Prometheus Alert Manager: How do I prevent grouping in notifications - prometheus-alertmanager

I'm trying to setup Alert Manager in a simple setup where it would send one Slack notification for each notification in receives.
I've hoped to disable grouping by removing the group_by configuration.
The problem is, that when I send 2 alert one after the other, even though the Alert Manager shows the 2 alerts as 'Not Grouped' when I get Slack notifications, I get one message for the first alert, and then a second message, where the 2 alerts are grouped.
Here is the config.yml
route:
receiver: default-receiver
group_wait: 1s #30s
group_interval: 1s #5m
# repeat_interval: 10m
# group_by: [cluster, alertname]
receivers:
- name: default-receiver
slack_configs:
- channel: "#alerts-test"
Any ideas?

From the Prometheus documentation for configuration
You can use group_by: ['...'] in your Alert Manager as a solution.
However, this was introduced in v0.16. For More info, see this GitHub issue.

Related

Apache Pulsar Client - Broker notification of Closed consumer - how to resume data feed?

TLDR: using python client library to subscribe to pulsar topic. logs show: 'broker notification of consumer closed' when something happens server-side. subscription appears to be re-established according to logs but we find later that backlog was growing on cluster b/c no msgs being sent to our subscription to consume
Running into an issue where we have an Apache-Pulsar cluster we are using that is opaque to us, and has a namespace defined where we publish/consume topics, is losing connection with our consumer.
We have a python client consuming from a topic (with one Pulsar Client subscription per thread).
We have run into an issue where, due to an issue on the pulsar cluster, we see the following entry in our client logs:
"Broker notification of Closed consumer"
followed by:
"Created connection for pulsar://houpulsar05.mycompany.com:6650"
....for every thread in our agent.
Then we see the usual periodic log entries like this:
{"log":"2022-09-01 04:23:30.269 INFO [139640375858944] ConsumerStatsImpl:63 | Consumer [persistent://tenant/namespace/topicname, subscription-name, 0] , ConsumerStatsImpl (numBytesRecieved_ = 0, totalNumBytesRecieved_ = 6545742, receivedMsgMap_ = {}, ackedMsgMap_ = {}, totalReceivedMsgMap_ = {[Key: Ok, Value: 3294], }, totalAckedMsgMap_ = {[Key: {Result: Ok, ackType: 0}, Value: 3294], })\n","stream":"stdout","time":"2022-09-01T04:23:30.270009746Z"}
This gives the appearance that some connection has been re-established to some other broker.
However, we do not get any messages being consumed. We have an alert on Grafana dashboard which shows us the backlog on topics and subscription backlog. Eventually it either hits a count or rate thresshold which will alert us that there is a problem. When we restart our agent, the subscription is re-establish and the backlog is can immediately be seen heading to 0.
Has anyone experienced such an issue?
Our code is typical:
consumer = client.subscribe(
topic='my-topic',
subscription_name='my-subscription',
consumer_type=my_consumer_type,
consumer_name=my_agent_name
)
while True:
msg = consumer.receive()
ex = msg.value()
i haven't yet found a readily-available way docker-compose or anything to run a multi-cluster pulsar installation locally on Docker desktop for me to try killing off a broker and see how consumer reacts.
Currently Python client only supports configuring one broker's address and doesn't support retry for lookup yet. Here are two related PRs to support it:
https://github.com/apache/pulsar/pull/17162
https://github.com/apache/pulsar/pull/17410
Therefore, setting up a multi-nodes cluster might be nothing different from a standalone.
If you only specified one broker in the service URL, you can simply test it with a standalone. Run a consumer and a producer sending messages periodically, then restart the standalone. The "Broker notification of Closed consumer" appears when the broker actively closes the connection, e.g. your consumer has sent a SEEK command (by seek call), then broker will disconnect the consumer and the log appears.
BTW, it's better to show your Python client version. And GitHub issues might be a better place to track the issue.

Why alertmanager not sending sms

I am using alert-manager and Prometheus to monitor my stack.
When I configure sending mails via alert-manager it works fine but when I configure to send sms via webhook it fails.
I added debug flag to alert-manager to verify it is getting the alerts and indeed the alerts are coming but till no sms are sent.
I also checked the webhook separately and it works perfectly.
Config file is:
global:
resolve_timeout: '5m'
smtp_smarthost: 'smtp.office365.com:587'
smtp_from: 'no-reply#example.com'
smtp_auth_username: 'no-reply#example.com'
smtp_auth_password: 'xxxxx'
smtp_require_tls: true
route:
group_by: ['instance', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receiver: team-1
receivers:
- name: 'team-1'
webhook_configs:
- send_resolved: true
http_config: {}
#max_alerts: 0
url: 'https://rest.nexmo.com/sms/json?from=example&text=test&to=xxxxxxxxxxxx&api_key=xxxxxxx&api_secret=xxxxxxx'
email_configs:
- to: 'john.doe#example.com'
I tried putting only mails and it works , tried only sms and it oes not work.
What am I missing ?
Eventually , I wrote a sms 'proxy' that will get a simple url get invocation and internally will call nexmo.
It did not work any other way.

alertmanager filter by tag (timescale backend)

I am using alertmanager configured to read from a timescale db shared with other Prometheus/alertmanager systems.
I would like to set/check alerts only for services including a specific tag, therefore wondering how could I configure prometheus to apply only for specific tags?
This is what currently I am using:
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
remote_write:
- url: https://promscale.host:9201/write
remote_read:
- url: https://promscale.host:9201/read
read_recent: true
...
I found there is an option alert_relabel_configs but is unclear for me the usage of it.
Any ideas?
FYI, alert_relabel_configs are used for alert relabeling to alerts before they are sent to the Alertmanager
To use alert_relabel_configs below is the example to add a new tag on matching the relabel config set:
alert_relabel_configs:
- source_labels: [ log_level ]
regex: warn
target_label: severity
replacement: warn
Note: The alerts are only changed when sent to alertmanager. They are
not changed in the Prometheus UI.
To test the relabel config online you can use https://relabeler.promlabs.com/
If you are using Prometheus Operator configuring alert relabeling rules should be done in additionalAlertRelabelConfigs of PrometheusSpec, more details: https://github.com/prometheus-operator/prometheus-operator/issues/1805

Datadog Logs from Windows Event Viewer

I am new to DataDog and getting back into working with Windows Servers. I am trying to push Event Viewer logs (Security, System, etc) to Datadog logs. I have been successful in terms of setting it up (used their documentation - https://docs.datadoghq.com/integrations/win32_event_log/). I am getting logs into my DD for that server for my System and Security:
logs:
- type: windows_event
channel_path: "System"
source: "System"
service: System_Event
- type: windows_event
channel_path: "Security"
source: "Security"
service: Security_Event
I know that you can push items from the Event Viewer to Events in DD by using Instances and you can be more granular there. But I want that granularity in the logs sections since we rarely view Events. Right now it is showing me all the items in the logs, success, etc. I am looking to only get the Errors and Warnings piped to the Logs.
Thanks for the help.
D
Came across the same problem and came up with below config that exclude "Information" event.
- type: windows_event
channel_path: System
source: System
service: eventlog
log_processing_rules:
- type: exclude_at_match
name: exclude_information_event
pattern: ^.*[Ll]evel.*Information.*
Vincent

App Engine Flex deployment health check fails

I've made a Python 3 Flask app to serve as an API proxy with gunicorn. I've deployed the openapi to Cloud Endpoints and filled in the endpoints service in the app.yaml file.
When I try to deploy to app engine flex, the health check fails because it took too long. I've tried to alter the readiness_check's app_start_timeout_sec like suggested but to no avail. When checking the logs on stackdriver I can only see gunicorn booting a couple of workers and eventually terminating everything a couple times in a row. No further explanation of what goes wrong. I've also tried to specify resources in the app.yaml and scaling the workers in the gunicorn.conf.py file but to no avail.
Then I tried switching to uwsgi but this acted in the same way: starting up and terminating a couple of times in a row and health check timeout.
error:
ERROR: (gcloud.app.deploy) Error Response: [4] Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.
app.yaml
runtime: python
env: flex
entrypoint: gunicorn -c gunicorn.conf.py -b :$PORT main:app
runtime_config:
python_version: 3
endpoints_api_service:
name: 2019-09-27r0
rollout_strategy: managed
resources:
cpu: 1
memory_gb: 2
disk_size_gb: 10
gunicorn.conf.py:
import multiprocessing
bind = "127.0.0.1:8000"
workers = multiprocessing.cpu_count() * 2 + 1
requirments.txt:
aniso8601==8.0.0
certifi==2019.9.11
chardet==3.0.4
Click==7.0
Flask==1.1.1
Flask-Jsonpify==1.5.0
Flask-RESTful==0.3.7
gunicorn==19.9.0
idna==2.8
itsdangerous==1.1.0
Jinja2==2.10.1
MarkupSafe==1.1.1
pytz==2019.2
requests==2.22.0
six==1.12.0
urllib3==1.25.5
Werkzeug==0.16.0
pyyaml==5.1.2
Is there anyone who can spot a conflict or something I forgot in here? I'm out of ideas and really need help. It would also definitely help if someone could point me in the right direction where to find more info in the logs (I also run the gcloud app deploy with --verbosity=debug but this only shows "Updating service [default]... ...Waiting to retry."). I would really like to know what causes the health checks to timeout!
Thanks in advance!
You can both disable Health Checks or customize them:
For disabling you have to add the following to your app.yaml:
health_check:
enable_health_check: False
For customize them you can take a look into the Split health checks.
You can customize Liveness checks request by adding an optional liveness_check section on you app.yaml file, for example:
liveness_check:
path: "/liveness_check"
check_interval_sec: 30
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
In the documentation you can check the settings available for liveness checks.
In addition, there are the Readiness checks. In the same way, you can customize some settings, for example:
readiness_check:
path: "/readiness_check"
check_interval_sec: 5
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
The values above mentioned can be changed according to your needs. Check this values especially since App Engine Flexible takes some minutes to get the instance startup-ed, this is a remarkable difference to App Engine Standard and should not be taken lightly.
If you examine the nginx.health_check logs for your application, you might see health check polling happening more frequently than you have configured, due to the redundant health checkers that are also following your settings. These redundant health checkers are created automatically and you cannot configure them.

Resources