GCE alerting when one of created metrics is absent (via terraform) - terraform-provider-gcp

I have configured alert policies via terraform which included CPU/Memory and other alerting (many of them). Unfortunately, i have faced with issue when one of my GCE instance became unresponsive - i am receiving lot of alerts in my Slack because i have configured condition_absent block for all my policies.
For example:
condition_absent {
duration = "360s"
filter = "metric.type=\"custom.googleapis.com/quota/gce\" resource.type=\"global\""
aggregations {
alignment_period = "60s"
cross_series_reducer = "REDUCE_SUM"
group_by_fields = [
"metric.label.metric",
"metric.label.region",
]
per_series_aligner = "ALIGN_MEAN"
}
condition_absent {
duration = "360s"
filter = "metric.type=\"agent.googleapis.com/memory/percent_used\" resource.type=\"gce_instance\" metric.label.\"state\"=\"used\""
aggregations {
alignment_period = "60s"
cross_series_reducer = "REDUCE_SUM"
per_series_aligner = "ALIGN_MEAN"
}
My question is following: Can i create one condition_absent block in terraform instead of many and send one notification instead of tons in case one of metrics stopped to work?

I have resolved this by adding Monitoring Agent Uptime metric alert. It's correctly showing when the VM is inaccessible (under overload etc.)

Related

How do I send AWS Backup events to OpsGenie via Eventbridge?

I have a requirement to send AWS Backup events - specifically failed backups and backups that had Windows VSS fail on backup to a centralized Opsgenie alerting system. AWS directed us to use EventBridge to parse the JSON object produced by AWS Backups to determine whether the VSS portion failed or not.
SNS is not a viable option because we cannot 'OR' the two rules together in one filter policy, and we only have one endpoint so two subscriptions to the same topic will overwrite one. That said, I did successfully send messages to OpsGenie via SNS. So far with Eventbridge, I have not had any luck.
I have started to implement most of this in terraform. I realize TF has some limitations to using EventsBridge (my two rules cannot be tied to the custom bus I create; I have to do this step manually. Also, I need to create the Opsgenie API integration manually as Opsgenie does not seem to have support for the 'EventBridge' type yet. Only the older version of Cloudwatch events that ties into SNS seems to be there. Below is my terraform for reference:
# This module creates an opsgenie team and will tie in existing emails to the team to use with the integration.
module "opsgenie_team" {
source = "app.terraform.io/etc.../opsgenie"
version = "1.1.0"
team_name = "test team"
team_description = "test environment."
team_admin_emails = var.opsgenie_team_admins
team_user_emails = var.opsgenie_team_users
suppress_cloudwatch_events_notifications = var.opsgenie_suppress_cloudwatch_events_notifications
suppress_cloudwatch_notifications = var.opsgenie_suppress_cloudwatch_notifications
suppress_generic_sns_notifications = var.opsgenie_suppress_generic_sns_notifications
}
# Step commented out since 'Webhook' doesn't work.
#
# resource "opsgenie_api_integration" "opsgenie" {
# name = "api-based-int-2"
# type = "Webhook"
#
# responders {
# type = "user"
# id = data.opsgenie_user.test.id
# }
#
# enabled = true
# allow_write_access = true
# suppress_notifications = false
# webhook_url = module.opsgenie_team.cloudwatch_events_integration_sns_endpoint
# }
resource "aws_cloudwatch_event_api_destination" "opsgenie" {
name = "Test"
description = "Connection to OpsGenie"
invocation_endpoint = module.opsgenie_team.cloudwatch_events_integration_sns_endpoint
http_method = "POST"
invocation_rate_limit_per_second = 20
connection_arn = aws_cloudwatch_event_connection.opsgenie.arn
}
resource "aws_cloudwatch_event_connection" "opsgenie" {
name = "opsgenie-event-connection"
description = "Connection to OpsGenie"
authorization_type = "API_KEY"
# Verified key seems to be valid on integration API
# https://api.opsgenie.com/v2/integrations
auth_parameters {
api_key {
key = module.opsgenie_team.cloudwatch_events_integration_id
value = module.opsgenie_team.cloudwatch_events_integration_api_key
}
}
}
# Opsgenie ID created with the manual integration step.
data "aws_cloudwatch_event_source" "opsgenie" {
name_prefix = "aws.partner/opsgenie.com/MY-OPSGENIE-ID"
}
resource "aws_cloudwatch_event_bus" "opsgenie" {
name = data.aws_cloudwatch_event_source.opsgenie.name
event_source_name = data.aws_cloudwatch_event_source.opsgenie.name
}
# Two rules I need to filter on, commented out as they cannot be tied to a custom bus with
# terraform.
# resource "aws_cloudwatch_event_rule" "opsgenie_backup_failures" {
# name = "capture-generic-backup-failures"
# description = "Capture all other backup failures"
#
# event_pattern = <<EOF
# {
# "State": [
# {
# "anything-but": "COMPLETED"
# }
# ]
# }
# EOF
# }
#
# resource "aws_cloudwatch_event_rule" "opsgenie_vss_failures" {
# name = "capture-vss-failures"
# description = "Capture VSS Backup failures"
#
# event_pattern = <<EOF
# {
# "detail-type" : [
# "Windows VSS Backup attempt failed because either Instance or SSM Agent has invalid state or insufficient privileges."
# ]
# }
# EOF
# }
The event bus and API destination seem to be created correctly, and I can find the API key used to communicate with Opsgenie and use it in postman to hit an Opsgenie endpoint. I manually create the rules and tie them in to the custom bus. I even kept them open, hoping to capture any AWS backup events - nothing yet.
I feel like I'm close, but missing a critical detail (or two). Any help is greatly appreciated.
Posing the same question to Atlassian, they sent me this email:
We do have an open feature request for a direct, inbound integration
with EventBridge - I've added your info and a +1 to the request, so
hopefully we'll be able to add that in the future. For reference, the
request ID is OGS-4502.
In the meantime, though, you're correct - you'd need to either use our
CloudWatch Events integration or a direct SNS integration, instead,
which may restrict some of the functionality you would get using
EventBridge directly. With that said - Opsgenie does offer robust
filtering functionality via the advanced integration settings and
alert policies that may be able to achieve the same sort of filtering
you would want to set up on the EventBridge side of things:
https://support.atlassian.com/opsgenie/docs/use-advanced-integration-settings/
https://support.atlassian.com/opsgenie/docs/create-and-manage-global-alert-policies/
So, for now, the answer is to consume all events at the OpsGenie endpoint and filter them with 'opsgenie_integration_action' resources.

setting up gke autopilot in terraform good example

I'm trying to setup GKE using terraform on autopilot. So far the documentation I looked at it a bit confusing. I'm looking for a basic setup on getting things running. I did a bit of searching on the web and I found the following https://www.youtube.com/watch?v=XTcos7s0iDo, but this contains too much detail about setting up the vpcs and everything, is there a basic example which I can use ?
you can just edit your existing gke.tf config and add
maintenance_policy {
recurring_window {
start_time = "2021-06-18T00:00:00Z"
end_time = "2050-01-01T04:00:00Z"
recurrence = "FREQ=WEEKLY"
}
}
{
enable_autopilot = true
}
release_channel {
channel = "REGULAR"
}

Terraform aws_lb_ssl_negotiation_policy using AWS Predefined SSL Security Policies

According to: https://www.terraform.io/docs/providers/aws/r/lb_ssl_negotiation_policy.html
You can create a new resource in order to have a ELB SSL Policy so you can customized any Protocol and Ciphers you want. However, I am looking to use Predefined Security Policies set by Amazon as
TLS-1-1-2017-01 or TLS-1-2-2017-01.
http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-security-policy-table.html
Is there a way to use predefined policies instead of set a new custom policy?
Looking to solve the same problem, I came across this snippet here: https://github.com/terraform-providers/terraform-provider-aws/issues/822#issuecomment-311448488
Basically, you need to create two resources, the aws_load_balancer_policy, and the aws_load_balancer_listener_policy. In the aws_load_balancer_policy you set the policy_attribute to reference the Predefined Security Policy, and then set your listener policy to reference that aws_load_balancer_policy.
I've added a Pull Request to the terraform AWS docs to make this more explicit here, but here's an example snippet:
resource "aws_load_balancer_policy" "listener_policy-tls-1-1" {
load_balancer_name = "${aws_elb.elb.name}"
policy_name = "elb-tls-1-1"
policy_type_name = "SSLNegotiationPolicyType"
policy_attribute {
name = "Reference-Security-Policy"
value = "ELBSecurityPolicy-TLS-1-1-2017-01"
}
}
resource "aws_load_balancer_listener_policy" "ssl_policy" {
load_balancer_name = "${aws_elb.elb.name}"
load_balancer_port = 443
policy_names = [
"${aws_load_balancer_policy.listener_policy-tls-1-1.policy_name}",
]
}
At first glance it appears that this is creating a custom policy that is based off of the predefined security policy, but when you look at what's created in the AWS console you'll see that it's actually just selected the appropriate Predefined Security Policy.
To piggy back on Kirkland's answer, for posterity, you can do the same thing with aws_lb_ssl_negotation_policy if you don't need any other policy types:
resource "aws_lb_ssl_negotiation_policy" "my-elb-ssl-policy" {
name = "my-elb-ssl-policy"
load_balancer = "${aws_elb.my-elb.id}"
lb_port = 443
attribute {
name = "Reference-Security-Policy"
value = "ELBSecurityPolicy-TLS-1-2-2017-01"
}
}
Yes, you can define it. And the default Security Policy ELBSecurityPolicy-2016-08 has covered all ssl protocols you asked for.
Secondly, Protocol-TLSv1.2 covers both policies (TLS-1-1-2017-01 or TLS-1-2-2017-01) you asked for as well.
(http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-security-policy-table.html)
So make sure you enable it with below codes:
resource "aws_lb_ssl_negotiation_policy" "foo" {
...
attribute {
name = "Protocol-TLSv1.2"
value = "true"
}
}

Disable Azure Automation Runbook Schedule using .net SDK

I am trying to disable a Runbook schedule using .NET SDK
Retrieved the JobScheduled i want to disable and tried setting the associated runbook and schedule to null and "".
var schedulenm = new ScheduleAssociationProperty();
schedulenm.Name = "";
var runbooknm = new RunbookAssociationProperty();
runbooknm.Name = "";
jocsched.Properties.Schedule = schedulenm;
jobsched.Properties.Runbook = runbooknm;
Also tried directly querying the main schedule and set the IsEnabled property to false.
However that also doesnt have any impact.
What is the correct way to disable the schedule associated with a runbook?
( just want it disabled not deleted)
According to your description, if you want to disable the schedule associated with a runbook. You could use AutomationManagementClient.JobSchedules.Delete method.
The JobSchedules means the relationship between the runbook and schedule.
After calling this method, the runbook will not associate with schedule, but it will not delete the schedule.
More details, you could refer to below code sample:
var r2 = automationManagementClient.JobSchedules.List("groupname", "accountname").JobSchedules.First();
automationManagementClient.JobSchedules.Delete("groupname", "accountname", r2.Properties.Id);
Result:
You could see the schedule still existed.
Image1:
Image2:
Would that be the exact equivalent of setting the 'Enabled' property to No in the UI?
No, if you want to disable the schedule, you should use AutomationManagementClient.Schedules.Patch method.
More details, you could refer to this codes:
AutomationManagementClient automationManagementClient = new AutomationManagementClient(aadTokenCredentials, resourceManagerUri);
SchedulePatchParameters p1 = new SchedulePatchParameters("yourSchedulename");
SchedulePatchProperties p2 = new SchedulePatchProperties();
p2.IsEnabled = false;
p1.Properties = p2;
var result = automationManagementClient.Schedules.Patch("rgname", "am accountname", p1).StatusCode;
Result:

NServiceBus Event Subscriptions Not Working With Azure Service Bus

I'm attempting to modify the Azure-based Video Store sample app so that the front-end Ecommerce site can scale out.
Specifically, I want all instances of the web site to be notified of events like OrderPlaced so that no matter which web server the client web app happens to be connected to via SignalR, it will correctly receive the notification and update the UI.
Below is my current configuration in the Global.asax:
Feature.Disable<TimeoutManager>();
Configure.ScaleOut(s => s.UseUniqueBrokerQueuePerMachine());
startableBus = Configure.With()
.DefaultBuilder()
.TraceLogger()
.UseTransport<AzureServiceBus>()
.PurgeOnStartup(true)
.UnicastBus()
.RunHandlersUnderIncomingPrincipal(false)
.RijndaelEncryptionService()
.CreateBus();
Configure.Instance.ForInstallationOn<Windows>().Install();
bus = startableBus.Start();
And I've also configured the Azure Service Bus queues using:
class AzureServiceBusConfiguration : IProvideConfiguration<NServiceBus.Config.AzureServiceBusQueueConfig>
{
public AzureServiceBusQueueConfig GetConfiguration()
{
return new AzureServiceBusQueueConfig()
{
QueuePerInstance = true
};
}
}
I've set the web role to scale to two instances, and as expected, two queues (ecommerce and ecommerce-1) are created. I do not, however, see additional topic subscriptions being created under the videostore.sales.events topic. Instead, I see:
I would think that you would see VideoStore.ECommerce-1.OrderCancelled and VideoStore.ECommerce-1.OrderPlaced subscriptions under the Videostore.Sales.Events topic. Or is that not how subscriptions are stored when using Azure Service Bus?
What am I missing here? I get the event on one of the ecommerce instances, but never on both. Even if this isn't the correct way to scale out SignalR, my use case extends to stuff like cache invalidation.
I also find it strange that two error and audit queues are being created. Why would that happen?
UPDATE
Yves is correct. The AzureServiceBusSubscriptionNamingConvention was not applying the correct individualized name. I was able to fix this by implementing the following EndpointConfig:
namespace VideoStore.ECommerce
{
public class EndpointConfig : IConfigureThisEndpoint, IWantCustomInitialization
{
public void Init()
{
AzureServiceBusSubscriptionNamingConvention.Apply = BuildSubscriptionName;
AzureServiceBusSubscriptionNamingConvention.ApplyFullNameConvention = BuildSubscriptionName;
}
private static string BuildSubscriptionName(Type eventType)
{
var subscriptionName = eventType != null ? Configure.EndpointName + "." + eventType.Name : Configure.EndpointName;
if (subscriptionName.Length >= 50)
subscriptionName = new DeterministicGuidBuilder().Build(subscriptionName).ToString();
if (!SettingsHolder.GetOrDefault<bool>("ScaleOut.UseSingleBrokerQueue"))
subscriptionName = Individualize(subscriptionName);
return subscriptionName;
}
public static string Individualize(string queueName)
{
var parser = new ConnectionStringParser();
var individualQueueName = queueName;
if (SafeRoleEnvironment.IsAvailable)
{
var index = parser.ParseIndexFrom(SafeRoleEnvironment.CurrentRoleInstanceId);
var currentQueue = parser.ParseQueueNameFrom(queueName);
if (!currentQueue.EndsWith("-" + index.ToString(CultureInfo.InvariantCulture))) //individualize can be applied multiple times
{
individualQueueName = currentQueue
+ (index > 0 ? "-" : "")
+ (index > 0 ? index.ToString(CultureInfo.InvariantCulture) : "");
}
if (queueName.Contains("#"))
individualQueueName += "#" + parser.ParseNamespaceFrom(queueName);
}
return individualQueueName;
}
}
}
I could not, however, get NServiceBus to recognize my EndpointConfig class. Instead, I had to call it manually before starting the bus. From my Global.asax.cs:
new EndpointConfig().Init();
bus = startableBus.Start();
Once I did this, the subscription names appeared as expected:
Not sure why it's ignoring my IConfigureThisEndpoint, but this works.
This sounds like a bug, can you raise a github issue on this at https://github.com/Particular/NServiceBus.Azure
That said, I think it's better to use signalr's scaleout feature instead of using QueuePerInstance as signalr needs to replicate other information like (connection/group mappings) internally as well when running in scaleout mode.
Update:
I think I see the issue, the subscriptions should be individualised as well, which isn't the case in current naming conventions
https://github.com/Particular/NServiceBus.Azure/blob/master/src/NServiceBus.Azure.Transports.WindowsAzureServiceBus/NamingConventions/AzureServiceBusSubscriptionNamingConvention.cs
while it is in the queuenamingconventions
https://github.com/Particular/NServiceBus.Azure/blob/master/src/NServiceBus.Azure.Transports.WindowsAzureServiceBus/NamingConventions/AzureServiceBusQueueNamingConvention.cs#L27
As these conventions are public you can override them to work around the problem by changing the func in IWantCustomInitialization until I can get a fix in, just copy the current method and add the individualizer logic. The queue individualizer is internal though, so you'll have to copy that class from
https://github.com/Particular/NServiceBus.Azure/blob/master/src/NServiceBus.Azure.Transports.WindowsAzureServiceBus/Config/QueueIndividualizer.cs

Resources