How do I send AWS Backup events to OpsGenie via Eventbridge? - terraform

I have a requirement to send AWS Backup events - specifically failed backups and backups that had Windows VSS fail on backup to a centralized Opsgenie alerting system. AWS directed us to use EventBridge to parse the JSON object produced by AWS Backups to determine whether the VSS portion failed or not.
SNS is not a viable option because we cannot 'OR' the two rules together in one filter policy, and we only have one endpoint so two subscriptions to the same topic will overwrite one. That said, I did successfully send messages to OpsGenie via SNS. So far with Eventbridge, I have not had any luck.
I have started to implement most of this in terraform. I realize TF has some limitations to using EventsBridge (my two rules cannot be tied to the custom bus I create; I have to do this step manually. Also, I need to create the Opsgenie API integration manually as Opsgenie does not seem to have support for the 'EventBridge' type yet. Only the older version of Cloudwatch events that ties into SNS seems to be there. Below is my terraform for reference:
# This module creates an opsgenie team and will tie in existing emails to the team to use with the integration.
module "opsgenie_team" {
source = "app.terraform.io/etc.../opsgenie"
version = "1.1.0"
team_name = "test team"
team_description = "test environment."
team_admin_emails = var.opsgenie_team_admins
team_user_emails = var.opsgenie_team_users
suppress_cloudwatch_events_notifications = var.opsgenie_suppress_cloudwatch_events_notifications
suppress_cloudwatch_notifications = var.opsgenie_suppress_cloudwatch_notifications
suppress_generic_sns_notifications = var.opsgenie_suppress_generic_sns_notifications
}
# Step commented out since 'Webhook' doesn't work.
#
# resource "opsgenie_api_integration" "opsgenie" {
# name = "api-based-int-2"
# type = "Webhook"
#
# responders {
# type = "user"
# id = data.opsgenie_user.test.id
# }
#
# enabled = true
# allow_write_access = true
# suppress_notifications = false
# webhook_url = module.opsgenie_team.cloudwatch_events_integration_sns_endpoint
# }
resource "aws_cloudwatch_event_api_destination" "opsgenie" {
name = "Test"
description = "Connection to OpsGenie"
invocation_endpoint = module.opsgenie_team.cloudwatch_events_integration_sns_endpoint
http_method = "POST"
invocation_rate_limit_per_second = 20
connection_arn = aws_cloudwatch_event_connection.opsgenie.arn
}
resource "aws_cloudwatch_event_connection" "opsgenie" {
name = "opsgenie-event-connection"
description = "Connection to OpsGenie"
authorization_type = "API_KEY"
# Verified key seems to be valid on integration API
# https://api.opsgenie.com/v2/integrations
auth_parameters {
api_key {
key = module.opsgenie_team.cloudwatch_events_integration_id
value = module.opsgenie_team.cloudwatch_events_integration_api_key
}
}
}
# Opsgenie ID created with the manual integration step.
data "aws_cloudwatch_event_source" "opsgenie" {
name_prefix = "aws.partner/opsgenie.com/MY-OPSGENIE-ID"
}
resource "aws_cloudwatch_event_bus" "opsgenie" {
name = data.aws_cloudwatch_event_source.opsgenie.name
event_source_name = data.aws_cloudwatch_event_source.opsgenie.name
}
# Two rules I need to filter on, commented out as they cannot be tied to a custom bus with
# terraform.
# resource "aws_cloudwatch_event_rule" "opsgenie_backup_failures" {
# name = "capture-generic-backup-failures"
# description = "Capture all other backup failures"
#
# event_pattern = <<EOF
# {
# "State": [
# {
# "anything-but": "COMPLETED"
# }
# ]
# }
# EOF
# }
#
# resource "aws_cloudwatch_event_rule" "opsgenie_vss_failures" {
# name = "capture-vss-failures"
# description = "Capture VSS Backup failures"
#
# event_pattern = <<EOF
# {
# "detail-type" : [
# "Windows VSS Backup attempt failed because either Instance or SSM Agent has invalid state or insufficient privileges."
# ]
# }
# EOF
# }
The event bus and API destination seem to be created correctly, and I can find the API key used to communicate with Opsgenie and use it in postman to hit an Opsgenie endpoint. I manually create the rules and tie them in to the custom bus. I even kept them open, hoping to capture any AWS backup events - nothing yet.
I feel like I'm close, but missing a critical detail (or two). Any help is greatly appreciated.

Posing the same question to Atlassian, they sent me this email:
We do have an open feature request for a direct, inbound integration
with EventBridge - I've added your info and a +1 to the request, so
hopefully we'll be able to add that in the future. For reference, the
request ID is OGS-4502.
In the meantime, though, you're correct - you'd need to either use our
CloudWatch Events integration or a direct SNS integration, instead,
which may restrict some of the functionality you would get using
EventBridge directly. With that said - Opsgenie does offer robust
filtering functionality via the advanced integration settings and
alert policies that may be able to achieve the same sort of filtering
you would want to set up on the EventBridge side of things:
https://support.atlassian.com/opsgenie/docs/use-advanced-integration-settings/
https://support.atlassian.com/opsgenie/docs/create-and-manage-global-alert-policies/
So, for now, the answer is to consume all events at the OpsGenie endpoint and filter them with 'opsgenie_integration_action' resources.

Related

Terraform check if resource exists before creating it

Is there a way in Terraform to check if a resource in Google Cloud exists prior to trying to create it?
I want to check if the following resources below exist in my CircleCI CI/CD pipeline during a job. I have access to terminal commands, bash, and gcloud commands. If the resources do exist, I want to use them. If they do not exist, I want to create them. I am doing this logic in CircleCI's config.yml as steps where I have access to terminal commands and bash. My goal is to create my necessary infrastructure (resources) in GCP when they are needed, otherwise use them if they are created, without getting Terraform errors in my CI/CD builds.
If I try to create a resource that already exists, Terraform apply will result in an error saying something like, "you already own this resource," and now my CI/CD job fails.
Below is pseudo code describing the resources I am trying to get.
resource "google_artifact_registry_repository" "main" {
# this is the repo for hosting my Docker images
# it does not have a data source afaik because it is beta
}
For my google_artifact_registry_repository resource. One approach I have is to do a Terraform apply using a data source block and see if a value is returned. The problem with this is that the google_artifact_registry_repository does not have a data source block. Therefore, I must create this resource once using a resource block and every CI/CD build thereafter can rely on it being there. Is there a work-around to read that it exists?
resource "google_storage_bucket" "bucket" {
# bucket containing the folder below
}
resource "google_storage_bucket_object" "content_folder" {
# folder containing Terraform default.tfstate for my Cloud Run Service
}
For my google_storage_bucket and google_storage_bucket_object resources. If I do a Terraform apply using a data source block to see if these exist, one issue I run into is when the resources are not found, Terraform takes forever to return that status. It would be great if I could determine if a resource exists within like 10-15 seconds or something, and if not assume these resources do not exist.
data "google_storage_bucket" "bucket" {
# bucket containing the folder below
}
output bucket {
value = data.google_storage_bucket.bucket
}
When the resource exists, I can use Terraform output bucket to get that value. If it does not exist, Terraform takes too long to return a response. Any ideas on this?
Thanks to the advice of Marcin, I have a working example of how to solve my problem of checking if a resource exists in GCP using Terraform's external data sources. This is one way that works. I am sure there are other approaches.
I have a CircleCI config.yml where I have a job that uses run commands and bash. From bash, I will init/apply a Terraform script that checks if my resource exists, like so below.
data "external" "get_bucket" {
program = ["bash","gcp.sh"]
query = {
bucket_name = var.bucket_name
}
}
output "bucket" {
value = data.external.get_bucket.result.name
}
Then in my gcp.sh, I use gsutil to get my bucket if it exists.
#!/bin/bash
eval "$(jq -r '#sh "BUCKET_NAME=\(.bucket_name)"')"
bucket=$(gsutil ls gs://$BUCKET_NAME)
if [[ ${#bucket} -gt 0 ]]; then
jq -n --arg name "" '{name:"'$BUCKET_NAME'"}'
else
jq -n --arg name "" '{name:""}'
fi
Then in my CircleCI config.yml, I put it all together.
terraform init
terraform apply -auto-approve -var bucket_name=my-bucket
bucket=$(terraform output bucket)
At this point I check if the bucket name is returned and determine how to proceed based on that.
TF does not have any build in tools for checking if there are pre-existing resources, as this is not what TF is meant to do. However, you can create your own custom data source.
Using the custom data source you can program any logic you want, including checking for pre-existing resources and return that information to TF for future use.
There is a way to check if a resource already exists before creating the resource. But you should be aware of whether it exists. Using this approach, you need to know if the resource exists. If the resource does not exist, it'll give you an error.
I will demonstrate it by create/reading data from an Azure Resource Group. First, create a boolean variable azurerm_create_resource_group. You can set the value to true if you need to create the resource; otherwise, if you just want to read data from an existing resource, you can set it to false.
variable "azurerm_create_resource_group" {
type = bool
}
Next up, get data about the resource using the ternary operator supplying it to count, next do the same for creating the resource:
data "azurerm_resource_group" "rg" {
count = var.azurerm_create_resource_group == false ? 1 : 0
name = var.azurerm_resource_group
}
resource "azurerm_resource_group" "rg" {
count = var.azurerm_create_resource_group ? 1 : 0
name = var.azurerm_resource_group
location = var.azurerm_location
}
The code will create or read data from the resource group based on the value of the var.azurerm_resource_group. Next, combine the data from both the data and resource sections into a locals.
locals {
resource_group_name = element(coalescelist(data.azurerm_resource_group.rg.*.name, azurerm_resource_group.rg.*.name, [""]), 0)
location = element(coalescelist(data.azurerm_resource_group.rg.*.location, azurerm_resource_group.rg.*.location, [""]), 0)
}
Another way of doing it might be using terraformer to import the infra code.
I hope this helps.
This work for me:
Create data
data "gitlab_user" "user" {
for_each = local.users
username = each.value.user_name
}
Create resource
resource "gitlab_user" "user" {
for_each = local.users
name = each.key
username = data.gitlab_user.user[each.key].username != null ? data.gitlab_user.user[each.key].username : split("#", each.value.user_email)[0]
email = each.value.user_email
reset_password = data.gitlab_user.user[each.key].username != null ? false : true
}
P.S.
Variable
variable "users_info" {
type = list(
object(
{
name = string
user_name = string
user_email = string
access_level = string
expires_at = string
group_name = string
}
)
)
description = "List of users and their access to team's groups for newcommers"
}
Locals
locals {
users = { for user in var.users_info : user.name => user }
}

Unable to get machine type information for machine type n1-standard-2 in zone us-central-c because of insufficient permissions - Google Cloud Dataflow

I am not sure what I am missing but somehow I am not able to start the job and gets failed with insufficient permission:
Here is terraform code I run:
resource "google_dataflow_job" "poc-pubsub-stream" {
project = local.project_id
region = local.region
zone = local.zone
name = "poc-pubsub-to-cloud-storage"
template_gcs_path = "gs://dataflow-templates-us-central1/latest/Cloud_PubSub_to_GCS_Text"
temp_gcs_location = "gs://${module.poc-bucket.bucket.name}/tmp"
enable_streaming_engine = true
on_delete = "cancel"
service_account_email = google_service_account.poc-stream-sa.email
parameters = {
inputTopic = google_pubsub_topic.poc-topic.id
outputDirectory = "gs://${module.poc-bucket.bucket.name}/"
outputFilenamePrefix = "poc-"
outputFilenameSuffix = ".txt"
}
labels = {
pipeline = "poc-stream"
}
depends_on = [
module.poc-bucket,
google_pubsub_topic.poc-topic,
]
}
My SA permission that is used in the terraform code:
Any thoughts what I am missing?
The error describes being unable to get the machine type information because of insufficient permissions. To access the machine type information, add the roles/compute.viewer role to your service account.
The roles/compute.viewer role, to access machine type information and view other settings.
Refer to this doc for more information about the required permissions to create a Dataflow job.
It seems my DataFlow job needed to provide these following options. Per docs it was optional but in my case that needed to be defined.
...
network = data.terraform_remote_state.dev.outputs.network.network_name
subnetwork = data.terraform_remote_state.dev.outputs.network.subnets["us-east4/us-east4-dev"].self_link
...

Cloudflare page rules using terraform-cloudflare provider does not update page rules

I am using Terraform + Cloudflare provider.
I created a page rule the fist time I ran terraform plan + terraform apply.
Running the same command a second time returns the error:
Error: Failed to create page rule: error from makeRequest: HTTP status 400: content "{"success":false,"errors":[{"code":1004,"message":"Page Rule validation failed: See messages for details."}],"messages":[{"code":1,"message":".distinctTargetUrl: Your zone already has an existing page rule with that URL. If you are modifying only page rule settings use the Edit Page Rule option instead","type":null}],"result":null}"
TLDR: How can I make Terraform to update an existing page rule only by changing the definition in this file? Isn't it how this was supposed to work?
This is the terraform.tf file:
provider "cloudflare" {
email = "__EMAIL__"
api_key = "__GLOBAL_API_KEY__"
}
resource "cloudflare_zone_settings_override" "default_cloudflare_config" {
zone_id = "__ZONE_ID__"
settings {
always_online = "on"
always_use_https = "off"
min_tls_version = "1.0"
opportunistic_encryption = "on"
tls_1_3 = "zrt"
automatic_https_rewrites = "on"
ssl = "strict"
# 8 days
browser_cache_ttl = "691200"
}
}
resource "cloudflare_page_rule" "rule_bypass_wp_admin" {
target = "*.__DOMAIN__/*wp-admin*"
zone_id = "__ZONE_ID__"
priority = 2
status = "active"
actions {
always_use_https = true
always_online = "off"
cache_level = "bypass"
disable_apps = "true"
disable_performance = true
disable_security = true
}
}
Add the following line in your Page rule definition:
lifecycle {
ignore_changes = [priority]
}
This will instruct Terraform to ignore any changes in this field. That way when you run a terraform apply Terraform picks up the changes as an update to the existing resources as opposed to creating new resources.
In this case, Terraform tries to create a new Page rule which conflicts with Cloudflare limitation that you cannot have multiple page rules acting on the same resource path
TIP: Run terraform plan -out=tfplan this will print out the plan that will be applied on screen and to file. You then get some insight into the changes that Terraform will make and a chance to spot some unintended changes.
I still can't update via Terraform, so I used Python to delete it before recreating.
# Delete existing page rules using API before readding with Terraform
# For some reason, it I could not update then with Terraform without deleting first
# https://stackoverflow.com/questions/63942345/cloudflare-page-rules-using-terraform-cloudflare-provider-does-not-update-page-r
page_rules = cf.zones.pagerules.get(zone_id)
print(page_rules)
for pr in page_rules:
cf.zones.pagerules.delete(zone_id, pr.get('id'))
page_rules = cf.zones.pagerules.get(zone_id)
if page_rules:
exit('Failed to delete existing page rules for site')
Try removing the always_use_https argument so your actions block looks like this:
actions {
always_online = "off"
cache_level = "bypass"
disable_apps = "true"
disable_performance = true
disable_security = true
}
Today I discovered that there is some issue with this argument, it looks like a bug.

GCE alerting when one of created metrics is absent (via terraform)

I have configured alert policies via terraform which included CPU/Memory and other alerting (many of them). Unfortunately, i have faced with issue when one of my GCE instance became unresponsive - i am receiving lot of alerts in my Slack because i have configured condition_absent block for all my policies.
For example:
condition_absent {
duration = "360s"
filter = "metric.type=\"custom.googleapis.com/quota/gce\" resource.type=\"global\""
aggregations {
alignment_period = "60s"
cross_series_reducer = "REDUCE_SUM"
group_by_fields = [
"metric.label.metric",
"metric.label.region",
]
per_series_aligner = "ALIGN_MEAN"
}
condition_absent {
duration = "360s"
filter = "metric.type=\"agent.googleapis.com/memory/percent_used\" resource.type=\"gce_instance\" metric.label.\"state\"=\"used\""
aggregations {
alignment_period = "60s"
cross_series_reducer = "REDUCE_SUM"
per_series_aligner = "ALIGN_MEAN"
}
My question is following: Can i create one condition_absent block in terraform instead of many and send one notification instead of tons in case one of metrics stopped to work?
I have resolved this by adding Monitoring Agent Uptime metric alert. It's correctly showing when the VM is inaccessible (under overload etc.)

Terraform aws_lb_ssl_negotiation_policy using AWS Predefined SSL Security Policies

According to: https://www.terraform.io/docs/providers/aws/r/lb_ssl_negotiation_policy.html
You can create a new resource in order to have a ELB SSL Policy so you can customized any Protocol and Ciphers you want. However, I am looking to use Predefined Security Policies set by Amazon as
TLS-1-1-2017-01 or TLS-1-2-2017-01.
http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-security-policy-table.html
Is there a way to use predefined policies instead of set a new custom policy?
Looking to solve the same problem, I came across this snippet here: https://github.com/terraform-providers/terraform-provider-aws/issues/822#issuecomment-311448488
Basically, you need to create two resources, the aws_load_balancer_policy, and the aws_load_balancer_listener_policy. In the aws_load_balancer_policy you set the policy_attribute to reference the Predefined Security Policy, and then set your listener policy to reference that aws_load_balancer_policy.
I've added a Pull Request to the terraform AWS docs to make this more explicit here, but here's an example snippet:
resource "aws_load_balancer_policy" "listener_policy-tls-1-1" {
load_balancer_name = "${aws_elb.elb.name}"
policy_name = "elb-tls-1-1"
policy_type_name = "SSLNegotiationPolicyType"
policy_attribute {
name = "Reference-Security-Policy"
value = "ELBSecurityPolicy-TLS-1-1-2017-01"
}
}
resource "aws_load_balancer_listener_policy" "ssl_policy" {
load_balancer_name = "${aws_elb.elb.name}"
load_balancer_port = 443
policy_names = [
"${aws_load_balancer_policy.listener_policy-tls-1-1.policy_name}",
]
}
At first glance it appears that this is creating a custom policy that is based off of the predefined security policy, but when you look at what's created in the AWS console you'll see that it's actually just selected the appropriate Predefined Security Policy.
To piggy back on Kirkland's answer, for posterity, you can do the same thing with aws_lb_ssl_negotation_policy if you don't need any other policy types:
resource "aws_lb_ssl_negotiation_policy" "my-elb-ssl-policy" {
name = "my-elb-ssl-policy"
load_balancer = "${aws_elb.my-elb.id}"
lb_port = 443
attribute {
name = "Reference-Security-Policy"
value = "ELBSecurityPolicy-TLS-1-2-2017-01"
}
}
Yes, you can define it. And the default Security Policy ELBSecurityPolicy-2016-08 has covered all ssl protocols you asked for.
Secondly, Protocol-TLSv1.2 covers both policies (TLS-1-1-2017-01 or TLS-1-2-2017-01) you asked for as well.
(http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-security-policy-table.html)
So make sure you enable it with below codes:
resource "aws_lb_ssl_negotiation_policy" "foo" {
...
attribute {
name = "Protocol-TLSv1.2"
value = "true"
}
}

Resources