Setup Cloudwatch Alarm with Terraform that uses a query-expression - terraform

My goal is to setup an alarm in Cloudwatch via Terraform, that fires when disk_usage is above a certain treshold. The monitored metrics come from a Non-AWS-Server and are collected via CloudWatch Agent.
My first step was to do this manually, by setting up a metric that Selects the maximum disk_usage of all devices on a selected host:
SELECT MAX(disk_used_percent) FROM CWAgent WHERE host = 'MY_HOST'
I when successfully created an alarm based on this metric. Now I want to do the same thing with Terraform, but I cant figure out how to do that.
If I setup the Terraform-Resource to use a dimension for the host, then I get no results. If I try to setup a metric-query, then I get a conflict between Terraform and AWS, where Terraform tells me that my resource should not declare a "period"-Attribute but AWS demands it and will fail if not provided:
Error: Updating metric alarm failed: ValidationError: Period must not
be null
Currently, my resource looks like this:
resource "aws_cloudwatch_metric_alarm" "disk_usage_alarm" {
alarm_name = "Disk usage alarm on MY_HOST"
alarm_description = "One or more disks on MY_HOST are over 65% capacity"
comparison_operator = "GreaterThanOrEqualToThreshold"
threshold = "65"
evaluation_periods = "2"
datapoints_to_alarm = "1"
treat_missing_data = "missing"
actions_enabled = "false"
insufficient_data_actions = []
alarm_actions = []
ok_actions = []
metric_query {
id = "q1"
label = "Maximum disk_used_percentage for all disks on Host MY_HOST"
return_data = true
expression = "SELECT MAX(disk_used_percent) FROM CWAgent WHERE host = 'MY_HOST'"
}
}
Anyone knows whats wrong here and how to correctly setup this alarm via Terraform?

Related

How to generate multiple names / use results output?

When using azurecaf to generate multiple names like in the following code, how do I use the results output?
resource "azurecaf_name" "names" {
name = var.appname
resource_type = "azurerm_resource_group"
resource_types = ["azurerm_mssql_database"]
prefixes = [var.environment]
suffixes = [var.resource_group_location_short]
random_length = 5
clean_input = false
}
results - The generated name for the Azure resources based in the resource_types list
How to use this? Also, can I somehow debug / print out what results looks like? (I don't know if it is an array, a key-value structure etc)
You can view the results in two common ways. It is applicable to all attributes of the resource.
[1] Exporting the attribute required as a terraform output.
When you add any attribute as an output in your code by default terraform will show you the values with terraform apply.
In your used case.
output "caf_name_result" {
value = azurecaf_name.names.result
}
output "caf_name_results" {
value = azurecaf_name.names.results
}
Apply the config with the above outputs definitions you will have the below output on your terminal.
Changes to Outputs:
+ caf_name_result = (known after apply)
+ caf_name_results = (known after apply)
azurecaf_name.names: Creating...
azurecaf_name.names: Creation complete after 0s [id=YXp1cmVybV9yZXNvdXJjZV9ncm91cAlkZXYtcmctc3RhY2tvdmVyZmxvdy15b2RncC13ZXUKYXp1cmVybV9tc3NxbF9kYXRhYmFzZQlkZXYtc3FsZGItc3RhY2tvdmVyZmxvdy15b2RncC13ZXU=]
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
Outputs:
caf_name_result = "dev-rg-stackoverflow-yodgp-weu"
caf_name_results = tomap({
"azurerm_mssql_database" = "dev-sqldb-stackoverflow-yodgp-weu"
"azurerm_resource_group" = "dev-rg-stackoverflow-yodgp-weu"
})
After the first successful terraform apply you can view them anytime when you want by using the terraform output command.
$ terraform output
caf_name_result = "dev-rg-stackoverflow-yodgp-weu"
caf_name_results = tomap({
"azurerm_mssql_database" = "dev-sqldb-stackoverflow-yodgp-weu"
"azurerm_resource_group" = "dev-rg-stackoverflow-yodgp-weu"
})
You can be very specific also to check only particular output value.
$ terraform output caf_name_results
tomap({
"azurerm_mssql_database" = "dev-sqldb-stackoverflow-yodgp-weu"
"azurerm_resource_group" = "dev-rg-stackoverflow-yodgp-weu"
})
[2] View your applied resources via Terraform State Commands
This is only available after the resources are applied and only in cases when terraform execution was done from the same machine where this command is running. (in simple the identity doing terraform execution satisfies all the authentication, authorization and network connectivity conditions. )
It is not recommended, just to share another option available when requiring a quick look on the resources applied.
$ terraform state list
azurecaf_name.names
$ terraform state show azurecaf_name.names
# azurecaf_name.names:
resource "azurecaf_name" "names" {
clean_input = false
id = "YXp1cmVybV9yZXNvdXJjZV9ncm91cAlkZXYtcmctc3RhY2tvdmVyZmxvdy15b2RncC13ZXUKYXp1cmVybV9tc3NxbF9kYXRhYmFzZQlkZXYtc3FsZGItc3RhY2tvdmVyZmxvdy15b2RncC13ZXU="
name = "stackoverflow"
passthrough = false
prefixes = [
"dev",
]
random_length = 5
random_seed = 1676730686950185
random_string = "yodgp"
resource_type = "azurerm_resource_group"
resource_types = [
"azurerm_mssql_database",
]
result = "dev-rg-stackoverflow-yodgp-weu"
results = {
"azurerm_mssql_database" = "dev-sqldb-stackoverflow-yodgp-weu"
"azurerm_resource_group" = "dev-rg-stackoverflow-yodgp-weu"
}
separator = "-"
suffixes = [
"weu",
]
use_slug = true
}
[1]
[2]

The "count" value depends on resource attributes that cannot be determined until apply, so Terraform cannot predict how many instances will be created

I want to exempt certain policies for an Azure VM. I have the following terraform code to exempt the policies.
It uses locals to identify the scope on which policies should be exempt.
locals {
exemption_scope = try({
mg = length(regexall("(\\/managementGroups\\/)", var.scope)) > 0 ? 1 : 0,
sub = length(split("/", var.scope)) == 3 ? 1 : 0,
rg = length(regexall("(\\/managementGroups\\/)", var.scope)) < 1 ? length(split("/", var.scope)) == 5 ? 1 : 0 : 0,
resource = length(split("/", var.scope)) >= 6 ? 1 : 0,
})
expires_on = var.expires_on != null ? "${var.expires_on}T23:00:00Z" : null
metadata = var.metadata != null ? jsonencode(var.metadata) : null
# generate reference Ids when unknown, assumes the set was created with the initiative module
policy_definition_reference_ids = length(var.member_definition_names) > 0 ? [for name in var.member_definition_names :
replace(substr(title(replace(name, "/-|_|\\s/", " ")), 0, 64), "/\\s/", "")
] : var.policy_definition_reference_ids
exemption_id = try(
azurerm_management_group_policy_exemption.management_group_exemption[0].id,
azurerm_subscription_policy_exemption.subscription_exemption[0].id,
azurerm_resource_group_policy_exemption.resource_group_exemption[0].id,
azurerm_resource_policy_exemption.resource_exemption[0].id,
"")
}
and the above local is used like mentioned below
resource "azurerm_management_group_policy_exemption" "management_group_exemption" {
count = local.exemption_scope.mg
name = var.name
display_name = var.display_name
description = var.description
management_group_id = var.scope
policy_assignment_id = var.policy_assignment_id
exemption_category = var.exemption_category
expires_on = local.expires_on
policy_definition_reference_ids = local.policy_definition_reference_ids
metadata = local.metadata
}
Both the locals and azurerm_management_group_policy_exemption are part of the same module file. And Policy exemption is applied like mentioned below
module exemption_jumpbox_sql_vulnerability_assessment {
count = var.enable_jumpbox == true ? 1 : 0
source = "../policy_exemption"
name = "Exemption - SQL servers on machines should have vulnerability"
display_name = "Exemption - SQL servers on machines should have vulnerability"
description = "Not required for Jumpbox"
scope = module.create_jumbox_vm[0].virtual_machine_id
policy_assignment_id = module.security_center.azurerm_subscription_policy_assignment_id
policy_definition_reference_ids = var.exemption_policy_definition_ids
exemption_category = "Waiver"
depends_on = [module.create_jumbox_vm,module.security_center]
}
It works for an existing Azure VM. However it throws the following error while trying to provision the Azure VM and apply the policy exemption on this Azure VM.
Ideally, module.exemption_jumpbox_sql_vulnerability_assessment should get executed only after [module.create_jumbox_vm as it is defined as a dependent. But not sure why it is throwing the error
│ The "count" value depends on resource attributes that cannot be determined
│ until apply, so Terraform cannot predict how many instances will be
│ created. To work around this, use the -target argument to first apply only
│ the resources that the count depends on.
I tried to reproduce the scenario in my environment.
resource "azurerm_management_group_policy_exemption" "management_group_exemption" {
count = local.exemption_scope.mg
name = var.name
display_name = var.display_name
description = var.description
management_group_id = var.scope
policy_assignment_id = var.policy_assignment_id
exemption_category = var.exemption_category
expires_on = local.expires_on
policy_definition_reference_ids = local.policy_definition_reference_ids
metadata = local.metadata
}
locals {
exemption_scope = try({
...
})
Received the same error:
The "count" value depends on resource attributes that cannot be determined
│ until apply, so Terraform cannot predict how many instances will be
│ created. To work around this, use the -target argument to first apply only
│ the resources that the count depends on.
Referring to local values , the values will be known on the apply time only, and not during the apply time .So if it is not dependent on other sources , it will expmpt policies but it is dependent on the VM which may be still in process of creation.
So target only the resource that is dependent on first ,as only when vm is created is when the exemption policy can be assigned to that vm.
Check count:using-expressions-in-count | Terraform | HashiCorp Developer
Also note that while using terraform count argument with Azure Virtual Machines ,NIC resource also to be created for each Virtual Machine resource.
resource "azurerm_network_interface" "nic" {
count = var.vm_count
name = "${var.vm_name_pfx}-${count.index}-nic"
location = data.azurerm_resource_group.example.location
resource_group_name = data.azurerm_resource_group.example.name
//tags = var.tags
ip_configuration {
name = "internal"
subnet_id = azurerm_subnet.internal.id
private_ip_address_allocation = "Dynamic"
}
}
Reference: terraform-azurerm-policy-exemptions/examples/count at main · AnsumanBal-MT/terraform-azurerm-policy-exemptions · GitHub

Managing Auto Scaling Group via terraform

Let say I have a auto-scaling group which I manage via terraform. And i want that auto scaling group to scale up and scale down based on our business hours .
The TF template for managing ASG :
resource "aws_autoscaling_group" "foobar" {
availability_zones = ["us-west-2a"]
name = "terraform-test-foobar5"
max_size = 1
min_size = 1
health_check_grace_period = 300
health_check_type = "ELB"
force_delete = true
termination_policies = ["OldestInstance"]
}
resource "aws_autoscaling_schedule" "foobar" {
scheduled_action_name = "foobar"
min_size = 0
max_size = 1
desired_capacity = 0
start_time = "2016-12-11T18:00:00Z"
end_time = "2016-12-12T06:00:00Z"
autoscaling_group_name = aws_autoscaling_group.foobar.name
}
As we can see here i have to set a particular date and time for the action.
what I want is : I want to scale down on saturday night 9 pm by 10% of my current capacity, and then again want to scale up by 10% on monday morning 6 am .
How can I achieve this.
Any help is highly appreciated. Please let me know how to get through this.
The solution is not straightforward, but is doable. The required steps are:
create a Lambda function that scales down the ASG (e.g. with Boto3 and Python)
assign an IAM role with the right permissions
create a Cron trigger for "every saturday 9pm" with aws_cloudwatch_event_rule
create a aws_cloudwatch_event_target, with the previously created Cron trigger and Lambda function
repeat for scaling up
This module will probably fit your needs, you just have to code the Lambda and use the module to trigger it on a schedule.

Terraform Datadog Query Is Invalid

I'm trying to test out creating a monitor for google pub sub and am getting an "Invalid Query" error. This is the query text when i view source of another working monitor, so i'm confused as to why this isn't working.
Error: Error: error creating monitor: 400 Bad Request: {"errors":["The value provided for parameter 'query' is invalid"]}
Terraform:
resource "datadog_monitor" "bad_stuff_sub_monitor" {
name = "${var.customer_name} Bad Stuff Monitor"
type = "metric alert"
message = "${var.customer_name} Bad Stuff Topic getting too big. Notify: ${var.datadog_monitor_notify_list}"
escalation_message = "Escalation message #pagerduty"
query = "avg:gcp.pubsub.subscription.num_undelivered_messages{project_id:terraform_gcp_test}"
thresholds = {
ok = 0
warning = 1
warning_recovery = 0
critical = 2
critical_recovery = 1
}
notify_no_data = false
renotify_interval = 1440
notify_audit = false
timeout_h = 60
include_tags = true
# ignore any changes in silenced value; using silenced is deprecated in favor of downtimes
lifecycle {
ignore_changes = [silenced]
}
tags = [var.customer_name, var.project_name]
}
So I ended up just looking at the tests in the datadog terraform provider and noticing the query format they are testing.
query = "avg(last_30m):avg:gcp.pubsub.subscription.num_undelivered_messages{project_id:${var.project_name},subscription_id:{project_id:terraform_gcp_test} > 2"
It seems you need to specify a time range and also add in a comparison threshold that matches your critical alert threshold. That was what was missing.

Terraform .11 to .12 conversion of deeply nested data

So, in my old .11 code, I have a file where i my output modules locals section, I'm building:
this_assigned_nat_ip = google_compute_instance.this_public.*.network_interface.0.access_config.0.assigned_nat_ip--
Which later gets fed to the output statement.  This module could create N instances. So what it used to do was give me the first nat ip on the first access_config block on the first network interface of all the instances we created.  (Someone locally wrote the code so we know that there's only going to be one network interface with one access config block).
How do I translate that to t12?  I'm unsure of the syntax to keep the nesting.
Update:
Here's a chunk of the raw data out of a terraform show from tf11 (slightly sanitized)
module.gcp_bob_servers_ams.google_compute_instance.this_public.0:
machine_type = n1-standard-2
min_cpu_platform =
network_interface.# = 1
network_interface.0.access_config.# = 1
network_interface.0.access_config.0.assigned_nat_ip =
network_interface.0.access_config.0.nat_ip = 1.2.3.4
network_interface.0.access_config.0.network_tier = PREMIUM
Terraform show of equivalent host in tf12:
# module.bob.module.bob_gcp_ams.module.atom_d.google_compute_instance.this[1]:
resource "google_compute_instance" "this" {
allow_stopping_for_update = true
network_interface {
name = "nic0"
network = "https://www.googleapis.com/compute/v1/projects/stuff-scratch/global/networks/scratch-public"
network_ip = "10.112.112.6"
subnetwork = "https://www.googleapis.com/compute/v1/projects/stuff-scratch/regions/europe-west4/subnetworks/scratch-europe-west4-x-public-subnet"
subnetwork_project = "stuff-scratch"
access_config {
nat_ip = "35.204.132.177"
network_tier = "PREMIUM"
}
}
scheduling {
automatic_restart = true
on_host_maintenance = "MIGRATE"
preemptible = false
}
}
If I understand correctly this_assigned_nat_ip is a list of IPs. You should be able to get the same thing in Terraform 0.12 by doing:
this_assigned_nat_ip = [for i in google_compute_instance.this_public : i.network_interface.0.access_config.0.assigned_nat_ip]
I did not test is, so I might have some small syntax error, but the for is the key to get that done.
Turns out this[*].network_interface[*].access_config[*].nat_ip[*] gave me what I needed. Given there's only every going to be one address on the interface, it comes out fine.

Resources