Cross-account subdomain/hosted zone delegation in Route 53 with Terraform - terraform

I have two environments and two AWS accounts: dev and prod. Hence, I have two hosted zones:
dev.example.com in the dev account
example.com in my prod account
In order to successfully route traffic to my dev.example.com subdomain, I need to delegate to my top-level domain (TLD) with a name server record in my TLD's hosted zone. E.g.,
dev.example.com NS Simple [ns-1960.awsdns-22.co.uk. ns-188.awsdns-20.com. ns-208.awsdns-37.net. ns-1089.awsdns-01.org.]
In Terraform code, I would define the two hosted zones as such:
resource "aws_route53_zone" "top_level_domain" {
count = var.env == "prod" ? 1 : 0
name = "example.com"
tags = {
name = "Hosted Zone for top-level domain in production"
env = var.env
}
}
resource "aws_route53_zone" "subdomain" {
count = var.env == "prod" ? 0 : 1
name = "dev.example.com"
tags = {
name = "Hosted Zone for ${var.env} environment"
env = var.env
}
}
In the interests of keeping everything codified, I would like to be able to define my delegation record in Terraform configuration. E.g.,
resource "aws_route53_record" "subdomain_delegation" {
count = var.env == "prod" ? 1 : 0
zone_id = aws_route53_zone.top_level_domain.zone_id
name = "dev.example.com"
type = "NS"
ttl = 300
records = [
aws_route53_zone.subdomain.name_servers
]
}
The issue lies in the aws_route53_zone.subdomain resource not existing in my Terraform state for the prod environment (and so aws_route53_zone.subdomain.name_servers) cannot be found.
Is there an elegant way to solve this? Or is this just a fact of life if one chooses to use AWS accounts for physical environment separation?
Update
The folder structure for my Terraform configuration roughly resembles:
dns/ (Terraform module)
dev/ (makes use of module)
prod/ (makes use of module)

The approach I'm currently using is to have two providers.
I have a master IAM user which can assume role in sub accounts.
That way - in one terraform script - I can target some actions to the root account alias, then other actions can be targeted to the sub-account alias.
So this allows some sharing of state between multiple accounts within one terraform module.

Related

Configure High Availability conditionally based on variable for `azurerm_postgresql_flexible_server`

I'm configuring my servers with terraform. For non-prod environments, our sku doesn't allow for high availability, but in prod our sku does.
For some reason high_availability.mode only accepts the value of "ZoneRedundant" for high availability, but doesn't accept any other value (according to the documentation). Depending on whether or not var.isProd is true, I want to turn high availability on and off, but how would I do that?
resource "azurerm_postgresql_flexible_server" "default" {
name = "example-${var.env}-postgresql-server"
location = azurerm_resource_group.default.location
resource_group_name = azurerm_resource_group.default.name
version = "14"
administrator_login = "sqladmin"
administrator_password = random_password.postgresql_server.result
geo_redundant_backup_enabled = var.isProd
backup_retention_days = var.isProd ? 60 : 7
storage_mb = 32768
high_availability {
mode = "ZoneRedundant"
}
sku_name = var.isProd ? "B_Standard_B2s" : "B_Standard_B1ms"
}
I believe the default assignment for this resource would be disabled HA, and therefore it is not the argument mode which manages the HA, but rather the existence of the high_availability block. Therefore, you could manage the HA by excluding the block to accept the default "disabled", or including the block to manage the HA as enabled with a value of ZoneRedundant:
dynamic "high_availability" {
for_each = var.isProd ? ["this"] : []
content {
mode = "ZoneRedundant"
}
}
I am hypothesizing somewhat on the API endpoint parameter defaults, so this would need to be acceptance tested with an Azure account. However, the documentation for the Azure Postgres Flexible Server in general claims HA is in fact disabled by default, so this should function as desired.
If you deploy flexible server using azurerm provider in terraform it will accept only ZoneRedundant value for highavailability.mode and also using this provider you can deploy sql server versions [11 12 13] only.
Using Azapi provider in terraform you can use highavailability values with any one of these "Disabled, ZoneRedundant or SameZone"
Based on your requirement I have created the below sample terraform script which has a environment variable accepts only prod or non-prod values using this value further flexible server will get deployed with respective properties.
If the environment value is prod script will deploy flexible server with high availability as zone redundant and also backup retention with 35 days with geo redundant backup enabled.
If the environment value is non-prod script will deploy flexible server with high availability as disabled and also backup retention with 7 days with geo redundant backup disabled
Here is the Terraform Script:
terraform {
required_providers {
azapi = {
source = "azure/azapi"
}
}
}
provider "azapi" {
}
variable "environment" {
type = string
validation {
condition = anytrue([var.environment == "prod",var.environment=="non-prod"])
error_message = "you havent defined any of allowed values"
}
}
resource "azapi_resource" "rg" {
type = "Microsoft.Resources/resourceGroups#2021-04-01"
name = "teststackhub"
location = "eastus"
parent_id = "/subscriptions/<subscriptionId>"
}
resource "azapi_resource" "test" {
type = "Microsoft.DBforPostgreSQL/flexibleServers#2022-01-20-preview"
name = "example-${var.environment}-postgresql-server"
location = azapi_resource.rg.location
parent_id = azapi_resource.rg.id
body = jsonencode({
properties= {
administratorLogin="azureuser"
administratorLoginPassword="<password>"
backup = {
backupRetentionDays = var.environment=="prod"?35:7
geoRedundantBackup = var.environment=="prod"?"Enabled":"Disabled"
}
storage={
storageSizeGB=32
}
highAvailability={
mode= var.environment=="prod"?"ZoneRedundant":"Disabled"
}
version = "14"
}
sku={
name = var.environment=="prod" ? "Standard_B2s" : "Standard_B1ms"
tier = "GeneralPurpose"
}
})
}
NOTE: The above terraform sample script is for your reference please do make the changes based on your business requirement.

Module enabling to deploy based on a specific variable ( Terraform/Datadog )

I'm looking to automate on specific part based on a very complicated Terraform script that I have.
To make a bit clear I have created TF template with Deployment of entire infra into Azure with App Services, Storage account, Security groups, Windows based VM's, Linux based VM's split for MongoDB and RabbitMQ. Inside my script I was able to automate deployment to use the name of the application and create Synthetic Test and using a local variable plus based on to the environment to use a specific Datadog Key using the local variable ability
keyTouse = lower(var.environment) != "production" ? var.DatadogNPD : var.DatadogPRD
right now the point that bothers me is the following.
Since we are not in a need to use Synthetic tests based on NON Production environment i would like to use some sort of logic and not deploy Synthetic tests if the var.environment is not "PRODUCTION"
To make this part more interesting i also have the ability to deploy multiple Synthetic Test using the "count" and "length" shown below
inside main.tf
module "Datadog" {
source = "./Datadog"
webapp_name = [ azurerm_linux_web_app.service1.name, azurerm_linux_web_app.service2.name ]
}
and for Datadog Synthetic Test
resource "datadog_synthetics_test" "app_service_monitoring" {
count = length(var.webapp_name)
type = "api"
subtype = "http"
request_definition {
method = "GET"
url = "https://${element(var.webapp_name, count.index)}.azurewebsites.net/health"
}
Could you help me and suggest how can I enable or disable modules deployment using the variable based on environment?
Based on my understanding of the question, the change would have to be two-fold:
Add an environment variable to the module code
Use that variable for deciding if the synthetics test resource should be created or not
The above translates to creating another variable in the module and later on providing that variable a value when calling the module. The last part would be deciding based on that if the resource gets created.
# module level variable
variable "environment" {
type = string
description = "Environment in which to deploy resources."
}
Then, in the resource, you would add the following:
resource "datadog_synthetics_test" "app_service_monitoring" {
count = var.environment == "production" ? length(var.webapp_name) : 0
type = "api"
subtype = "http"
request_definition {
method = "GET"
url = "https://${element(var.webapp_name, count.index)}.azurewebsites.net/health"
}
}
And finally, in the root module:
module "Datadog" {
source = "./Datadog"
webapp_name = [ azurerm_linux_web_app.service1.name, azurerm_linux_web_app.service2.name ]
environment = var.environment
}
The environment = var.environment line will work if you have also defined the variable environment in the root module. If not you can always set it to a value you want:
module "Datadog" {
source = "./Datadog"
webapp_name = [ azurerm_linux_web_app.service1.name, azurerm_linux_web_app.service2.name ]
environment = "dev" # <--- or "production" or any other environment you have
}

How do I implement a retry pattern with Terraform?

My use case: I need to create an AKS cluster with Terraform azurerm provider, and then set up a Network Watcher flow log for its NSG.
Note that as many other AKS resources, the corresponding NSG is not controlled by Terraform. Instead, it's created by Azure indirectly (and asynchronously), so I treat it as data, not resource.
Also note that Azure will create and use its own NSG even if the AKS is created with a customary created VNet.
Depending on the particular region and the Azure API gateway, my team has seen up to 40 minute delay between having the AKS created and then the NSG resource visible in the node pool resource group.
If I don't want my Terraform config to fail, I see 3 options:
Run a CLI script that waits for the NSG, make it a null_resource and depend on it
Implement the same with a custom provider
Have a really ugly workaround that implements a retry pattern - below is 10 attempts at 30 seconds each:
data "azurerm_resources" "my_nsg_1" {
resource_group_name = var.clusterNodeResourceGroup
type = "Microsoft.Network/networkSecurityGroups"
}
resource "time_sleep" "my_nsg_sleep1" {
count = length(data.azurerm_resources.my_nsg_1.resources) == 0 ? 1 : 0
create_duration = "30s"
triggers = {
ts = timestamp()
}
}
data "azurerm_resources" "my_nsg_2" {
depends_on = [time_sleep.my_nsg_sleep1]
resource_group_name = var.clusterNodeResourceGroup
type = "Microsoft.Network/networkSecurityGroups"
}
resource "time_sleep" "my_nsg_sleep2" {
count = length(data.azurerm_resources.my_nsg_1.resources) == 0 ? 1 : 0
create_duration = length(data.azurerm_resources.my_nsg_2.resources) == 0 ? "30s" : "0s"
triggers = {
ts = timestamp()
}
}
...
data "azurerm_resources" "my_nsg_11" {
depends_on = [time_sleep.my_nsg_sleep10]
resource_group_name = var.clusterNodeResourceGroup
type = "Microsoft.Network/networkSecurityGroups"
}
// Now azurerm_resources.my_nsg_11 is OK as long as the NSG was created and became visible to the current API Gateway within 5 minutes.
Note that Terraform doesn't allow resource repeating via the use of "for_each" or "count" at more than an individual resource level. In addition, because it resolves dependencies during the static phase, two sets of resource lists created with "count" or "for_each" cannot have dependencies at an individual element level of each other - you can only have one list depend on the other, obviously with no circular dependencies allowed.
E.g. my_nsg[count.index] cannot depend on my_nsg_delay[count.index-1] while my_nsg_delay[count.index] depends on my_nsg[count.index]
Hence this horrible non-DRY antipattern.
Is there a better declarative solution so I don't involve a custom provider or a script?

MySQL firewall rules from an app service's outbound IPs on Azure using Terraform

I'm using Terraform to deploy an app to Azure, including a MySQL server and an App Service, and want to restrict database access to only the app service. The app service has a list of outbound IPs, so I think I need to create firewall rules for these on the database. I've found that in Terraform, I can't use count or for_each to dynamically create these rules, as the value isn't known in advance.
We've also considered hard coding the count but the Azure docs don't confirm the number of IPs. With this, and after seeing different numbers in stackoverflow comments, I'm worried that the number could change at some point and break future deployments.
The output error suggests using -target as a workaround, but the Terraform docs explicitly advise against this due to potential risks.
Any suggestions for a solution? Is there a workaround, or is there another approach that would be better suited?
Non-functional code I'm using so far to give a better idea of what I'm trying to do:
...
locals {
appIps = split(",", azurerm_app_service.appService.outbound_ip_addresses)
}
resource "azurerm_mysql_firewall_rule" "appFirewallRule" {
count = length(appIps)
depends_on = [azurerm_app_service.appService]
name = "appService-${count.index}"
resource_group_name = "myResourceGroup"
server_name = azurerm_mysql_server.databaseServer.name
start_ip_address = local.appIps[count.index]
end_ip_address = local.appIps[count.index]
}
...
This returns the error:
Error: Invalid count argument
on main.tf line 331, in resource "azurerm_mysql_firewall_rule" "appFirewallRule":
331: count = length(local.appIps)
The "count" value depends on resource attributes that cannot be determined
until apply, so Terraform cannot predict how many instances will be created.
To work around this, use the -target argument to first apply only the
resources that the count depends on.
I dig deeper there and I think I have a solution which works at least for me :)
The core of the problem here is in the necessity to do the whole thing in 2 steps (we can't have not yet known values as arguments for count and for_each). It can be solved with explicit imperative logic or actions (like using -targetonce or commenting out and then uncommenting). Besides, it's not declarative it's also not suitable for automation via CI/CD (I am using Terraform Cloud and not the local environment).
So I am doing it just with Terraform resources, the only "imperative" pattern is to trigger the pipeline (or local run) twice.
Check my snippet:
data "azurerm_resources" "web_apps_filter" {
resource_group_name = var.rg_system_name
type = "Microsoft.Web/sites"
required_tags = {
ProvisionedWith = "Terraform"
}
}
data "azurerm_app_service" "web_apps" {
count = length(data.azurerm_resources.web_apps_filter.resources)
resource_group_name = var.rg_system_name
name = data.azurerm_resources.web_apps_filter.resources[count.index].name
}
data "azurerm_resources" "func_apps_filter" {
resource_group_name = var.rg_storage_name
type = "Microsoft.Web/sites"
required_tags = {
ProvisionedWith = "Terraform"
}
}
data "azurerm_app_service" "func_apps" {
count = length(data.azurerm_resources.func_apps_filter.resources)
resource_group_name = var.rg_storage_name
name = data.azurerm_resources.func_apps_filter.resources[count.index].name
}
locals {
# flatten ensures that this local value is a flat list of IPs, rather
# than a list of lists of IPs.
# distinct ensures that we have only uniq IPs
web_ips = distinct(flatten([
for app in data.azurerm_app_service.web_apps : [
split(",", app.possible_outbound_ip_addresses)
]
]))
func_ips = distinct(flatten([
for app in data.azurerm_app_service.func_apps : [
split(",", app.possible_outbound_ip_addresses)
]
]))
}
resource "azurerm_postgresql_firewall_rule" "pgfr_func" {
for_each = toset(local.web_ips)
name = "web_app_ip_${replace(each.value, ".", "_")}"
resource_group_name = var.rg_storage_name
server_name = "${var.project_abbrev}-pgdb-${local.region_abbrev}-${local.environment_abbrev}"
start_ip_address = each.value
end_ip_address = each.value
}
resource "azurerm_postgresql_firewall_rule" "pgfr_web" {
for_each = toset(local.func_ips)
name = "func_app_ip_${replace(each.value, ".", "_")}"
resource_group_name = var.rg_storage_name
server_name = "${var.project_abbrev}-pgdb-${local.region_abbrev}-${local.environment_abbrev}"
start_ip_address = each.value
end_ip_address = each.value
}
The most important piece there is azurerm_resources resource - I am using it to do filtering on what web apps are already existing in my resource group (and managed by automation). I am doing DB firewall rules on that list, next terraform run, when newly created web app is there, it will also whitelist the lastly created web app.
An interesting thing is also the filtering of IPs - a lot of them are duplicated.
At the moment, using -target as a workaround is a better choice. Because with how Terraform works at present, it considers this sort of configuration to be incorrect. Using resource computed outputs as arguments to count and for_each is not recommended. Instead, using variables or derived local values which are known at plan time is the preferred approach. If you choose to go ahead with using computed values for count/for_each, this will sometimes require you to work around this using -target as illustrated above. For more details, please refer to here
Besides, the bug will be fixed in the pre-release 0.14 code. For more details, please

Why isn't my AWS ACM certificate validating?

I have a domain name registered in AWS Route53 with an ACM certificate. I am now attempting to both move that domain name and certificate to a new account as well as manage the resources with Terraform. I used the AWS CLI to move the domain name to the new account and it appears to have worked fine. Then I tried running this Terraform code to create a new certificate and hosted zone for the domain.
resource "aws_acm_certificate" "default" {
domain_name = "mydomain.io"
validation_method = "DNS"
}
resource "aws_route53_zone" "external" {
name = "mydomain.io"
}
resource "aws_route53_record" "validation" {
name = aws_acm_certificate.default.domain_validation_options.0.resource_record_name
type = aws_acm_certificate.default.domain_validation_options.0.resource_record_type
zone_id = aws_route53_zone.external.zone_id
records = [aws_acm_certificate.default.domain_validation_options.0.resource_record_value]
ttl = "60"
}
resource "aws_acm_certificate_validation" "default" {
certificate_arn = aws_acm_certificate.default.arn
validation_record_fqdns = [
aws_route53_record.validation.fqdn,
]
}
There are two things that are strange about this. Primarily, the certificate is created but the validation never completes. It's still in Pending validation status. I read somewhere after this failed that you can't auto validate and you need to create the CNAME record manually. So I went into the console and clicked the "add cname to route 53" button. This added the CNAME record appropriately to my new Route53 record that Terraform created. But it's been pending for hours. I've clicked that same button several times, only one CNAME was created, subsequent clicks have no effect.
Another oddity, and perhaps a clue, is that my website is still up and working. I believe this should have broken the website since the domain is now owned by a new account, routing to a different hosted zone on that new account, and has a certificate that's now still pending. However, everything still works as normal. So I think it's possible that the old certificate and hosted zone is effecting this. Do they need to release the domain and do I need to delete that certificate? Deleting the certificate on the old account sounds unnecessary. I should just no longer be given out.
I have not, yet, associated the certificate with Cloudfront or ALB which I intend to do. But since it's not validated, my Terrform code for creating a Cloudfront instance dies.
It turns out that my transferred domain came transferred with a set of name servers, however, the name servers in the Route53 hosted zone were all different. When these are created together through the console, it does the right thing. I'm not sure how to do the right thing here with Terraform, which I'm going to post another question about in the moment. But for now, the solution is to change the name servers on either the hosted zone or the registered domain to match each other.
It's working for me
######################
data "aws_route53_zone" "main" {
name = var.domain
private_zone = false
}
locals {
final_domain = var.wildcard_enable == true ? *.var.domain : var.domain
# final_domain = "${var.wildcard_enable == true ? "*.${var.domain}" : var.domain}"
}
resource "aws_acm_certificate" "cert" {
domain_name = local.final_domain
validation_method = "DNS"
tags = {
"Name" = var.domain
}
lifecycle {
create_before_destroy = true
}
}
resource "aws_route53_record" "cert_validation" {
depends_on = [aws_acm_certificate.cert]
zone_id = data.aws_route53_zone.main.id
name = sort(aws_acm_certificate.cert.domain_validation_options[*].resource_record_name)[0]
type = "CNAME"
ttl = "300"
records = [sort(aws_acm_certificate.cert.domain_validation_options[*].resource_record_value)[0]]
allow_overwrite = true
}
resource "aws_acm_certificate_validation" "cert" {
certificate_arn = aws_acm_certificate.cert.arn
validation_record_fqdns = [
aws_route53_record.cert_validation.fqdn
]
timeouts {
create = "60m"
}
}

Resources