How do I implement a retry pattern with Terraform? - terraform

My use case: I need to create an AKS cluster with Terraform azurerm provider, and then set up a Network Watcher flow log for its NSG.
Note that as many other AKS resources, the corresponding NSG is not controlled by Terraform. Instead, it's created by Azure indirectly (and asynchronously), so I treat it as data, not resource.
Also note that Azure will create and use its own NSG even if the AKS is created with a customary created VNet.
Depending on the particular region and the Azure API gateway, my team has seen up to 40 minute delay between having the AKS created and then the NSG resource visible in the node pool resource group.
If I don't want my Terraform config to fail, I see 3 options:
Run a CLI script that waits for the NSG, make it a null_resource and depend on it
Implement the same with a custom provider
Have a really ugly workaround that implements a retry pattern - below is 10 attempts at 30 seconds each:
data "azurerm_resources" "my_nsg_1" {
resource_group_name = var.clusterNodeResourceGroup
type = "Microsoft.Network/networkSecurityGroups"
}
resource "time_sleep" "my_nsg_sleep1" {
count = length(data.azurerm_resources.my_nsg_1.resources) == 0 ? 1 : 0
create_duration = "30s"
triggers = {
ts = timestamp()
}
}
data "azurerm_resources" "my_nsg_2" {
depends_on = [time_sleep.my_nsg_sleep1]
resource_group_name = var.clusterNodeResourceGroup
type = "Microsoft.Network/networkSecurityGroups"
}
resource "time_sleep" "my_nsg_sleep2" {
count = length(data.azurerm_resources.my_nsg_1.resources) == 0 ? 1 : 0
create_duration = length(data.azurerm_resources.my_nsg_2.resources) == 0 ? "30s" : "0s"
triggers = {
ts = timestamp()
}
}
...
data "azurerm_resources" "my_nsg_11" {
depends_on = [time_sleep.my_nsg_sleep10]
resource_group_name = var.clusterNodeResourceGroup
type = "Microsoft.Network/networkSecurityGroups"
}
// Now azurerm_resources.my_nsg_11 is OK as long as the NSG was created and became visible to the current API Gateway within 5 minutes.
Note that Terraform doesn't allow resource repeating via the use of "for_each" or "count" at more than an individual resource level. In addition, because it resolves dependencies during the static phase, two sets of resource lists created with "count" or "for_each" cannot have dependencies at an individual element level of each other - you can only have one list depend on the other, obviously with no circular dependencies allowed.
E.g. my_nsg[count.index] cannot depend on my_nsg_delay[count.index-1] while my_nsg_delay[count.index] depends on my_nsg[count.index]
Hence this horrible non-DRY antipattern.
Is there a better declarative solution so I don't involve a custom provider or a script?

Related

Issue provisioning Databricks workspace resources using Terraform

I have defined resource to provision databricks workspace on Azure using Terraform as follows which consumes the list ( of inputs from tfvar file for # of workspaces) and provision them.
resource "azurerm_databricks_workspace" "workspace" {
for_each = { for r in var.databricks_workspace_list : r.workspace_nm => r}
name = each.key
resource_group_name = each.value.resource_group_name
location = each.value.location
sku = "standard"
tags = {
Environment = "Dev"
}
}
I am trying to create additional resource as below
resource "databricks_instance_pool" "smallest_nodes" {
instance_pool_name = "Smallest Nodes"
min_idle_instances = 0
max_capacity = 300
node_type_id = data.databricks_node_type.smallest.id // data block is defined
idle_instance_autotermination_minutes = 10
}
To create instance pool, I need to pass workspace id in databricks provider block as below
provider "databricks" {
azure_client_id= *******
azure_client_secret= *******
azure_tenant_id= *******
azure_workspace_resource_id = azurerm_databricks_workspace.workspace.id
}
But when I do terraform plan, it fails with below error
Missing resource instance key
azure_workspace_resource_id = azurerm_databricks_workspace.workspace.id
Because azure_workspace_resource_id = azurerm_databricks_workspace has for_each set, its attribute must be accessed on specific instances.
For example, to correlate indices , use :
azurerm_databricks_workspace[each.key]
I couldnt use for_each in provider block, also not able to find out way to index workspace id in provider block.
Appreciate your inputs.
TF version : 0.13
Azure RM : 3.10.0
Databricks : 0.5.7
The problem is that you can create multiple workspaces when you're using for_each in the azurerm_databricks_workspace resource. But your provider block is trying to refer to a "generic" resource instance, so it's complaining.
The solution here would be either:
Remove for_each if you're creating just one workspace
instead of azurerm_databricks_workspace.workspace.id, you need to refer azurerm_databricks_workspace.workspace[<name>].id where the <name> is the specific instance of Databricks from the list of workspaces.
P.S. Your databricks_instance_pool resource doesn't have explicit depends_on, so the operation will fail with authentication error as described here.

Terraform: Azure VMSS rolling_upgrade does not re-image instances

Having the following VMSS terraform config:
resource "azurerm_linux_virtual_machine_scale_set" "my-vmss" {
...
instances = 2
...
upgrade_mode = "Rolling"
rolling_upgrade_policy {
max_batch_instance_percent = 100
max_unhealthy_instance_percent = 100
max_unhealthy_upgraded_instance_percent = 0
pause_time_between_batches = "PT10M"
}
extension {
name = "my-vmss-app-health-ext"
publisher = "Microsoft.ManagedServices"
type = "ApplicationHealthLinux"
automatic_upgrade_enabled = true
type_handler_version = "1.0"
settings =jsonencode({
protocol = "tcp"
port = 8080
})
...
}
However, whenever a change is applied (e.g., changing custom_data), the VMSS is updated but instances are not reimaged. Only after manual reimage (via UI or Azure CLI) do the instances get updated.
The "terraform plan" is as expected - custom_data change is detected:
# azurerm_linux_virtual_machine_scale_set.my-vmss will be updated in-place
~ resource "azurerm_linux_virtual_machine_scale_set" "my-vmss" {
...
~ custom_data = (sensitive value)
...
Plan: 0 to add, 1 to change, 0 to destroy.
Any idea of how to make Terraform cause the instance reimaging?
It looks like not a terraform issue but a "rolling upgrades" design by Azure. From here (1) it follows that updates to custom_data won't affect existing instances. I.e., until the instance is manually reimaged (e.g., via UI or azure CLI) it won't get the new custom_data (e.g., the new cloud-init script).
In contrast, AWS does refresh instances on custom_data updates. Please let me know if my understanding is incorrect or if you have an idea of how to work around this limitation in Azure.

MySQL firewall rules from an app service's outbound IPs on Azure using Terraform

I'm using Terraform to deploy an app to Azure, including a MySQL server and an App Service, and want to restrict database access to only the app service. The app service has a list of outbound IPs, so I think I need to create firewall rules for these on the database. I've found that in Terraform, I can't use count or for_each to dynamically create these rules, as the value isn't known in advance.
We've also considered hard coding the count but the Azure docs don't confirm the number of IPs. With this, and after seeing different numbers in stackoverflow comments, I'm worried that the number could change at some point and break future deployments.
The output error suggests using -target as a workaround, but the Terraform docs explicitly advise against this due to potential risks.
Any suggestions for a solution? Is there a workaround, or is there another approach that would be better suited?
Non-functional code I'm using so far to give a better idea of what I'm trying to do:
...
locals {
appIps = split(",", azurerm_app_service.appService.outbound_ip_addresses)
}
resource "azurerm_mysql_firewall_rule" "appFirewallRule" {
count = length(appIps)
depends_on = [azurerm_app_service.appService]
name = "appService-${count.index}"
resource_group_name = "myResourceGroup"
server_name = azurerm_mysql_server.databaseServer.name
start_ip_address = local.appIps[count.index]
end_ip_address = local.appIps[count.index]
}
...
This returns the error:
Error: Invalid count argument
on main.tf line 331, in resource "azurerm_mysql_firewall_rule" "appFirewallRule":
331: count = length(local.appIps)
The "count" value depends on resource attributes that cannot be determined
until apply, so Terraform cannot predict how many instances will be created.
To work around this, use the -target argument to first apply only the
resources that the count depends on.
I dig deeper there and I think I have a solution which works at least for me :)
The core of the problem here is in the necessity to do the whole thing in 2 steps (we can't have not yet known values as arguments for count and for_each). It can be solved with explicit imperative logic or actions (like using -targetonce or commenting out and then uncommenting). Besides, it's not declarative it's also not suitable for automation via CI/CD (I am using Terraform Cloud and not the local environment).
So I am doing it just with Terraform resources, the only "imperative" pattern is to trigger the pipeline (or local run) twice.
Check my snippet:
data "azurerm_resources" "web_apps_filter" {
resource_group_name = var.rg_system_name
type = "Microsoft.Web/sites"
required_tags = {
ProvisionedWith = "Terraform"
}
}
data "azurerm_app_service" "web_apps" {
count = length(data.azurerm_resources.web_apps_filter.resources)
resource_group_name = var.rg_system_name
name = data.azurerm_resources.web_apps_filter.resources[count.index].name
}
data "azurerm_resources" "func_apps_filter" {
resource_group_name = var.rg_storage_name
type = "Microsoft.Web/sites"
required_tags = {
ProvisionedWith = "Terraform"
}
}
data "azurerm_app_service" "func_apps" {
count = length(data.azurerm_resources.func_apps_filter.resources)
resource_group_name = var.rg_storage_name
name = data.azurerm_resources.func_apps_filter.resources[count.index].name
}
locals {
# flatten ensures that this local value is a flat list of IPs, rather
# than a list of lists of IPs.
# distinct ensures that we have only uniq IPs
web_ips = distinct(flatten([
for app in data.azurerm_app_service.web_apps : [
split(",", app.possible_outbound_ip_addresses)
]
]))
func_ips = distinct(flatten([
for app in data.azurerm_app_service.func_apps : [
split(",", app.possible_outbound_ip_addresses)
]
]))
}
resource "azurerm_postgresql_firewall_rule" "pgfr_func" {
for_each = toset(local.web_ips)
name = "web_app_ip_${replace(each.value, ".", "_")}"
resource_group_name = var.rg_storage_name
server_name = "${var.project_abbrev}-pgdb-${local.region_abbrev}-${local.environment_abbrev}"
start_ip_address = each.value
end_ip_address = each.value
}
resource "azurerm_postgresql_firewall_rule" "pgfr_web" {
for_each = toset(local.func_ips)
name = "func_app_ip_${replace(each.value, ".", "_")}"
resource_group_name = var.rg_storage_name
server_name = "${var.project_abbrev}-pgdb-${local.region_abbrev}-${local.environment_abbrev}"
start_ip_address = each.value
end_ip_address = each.value
}
The most important piece there is azurerm_resources resource - I am using it to do filtering on what web apps are already existing in my resource group (and managed by automation). I am doing DB firewall rules on that list, next terraform run, when newly created web app is there, it will also whitelist the lastly created web app.
An interesting thing is also the filtering of IPs - a lot of them are duplicated.
At the moment, using -target as a workaround is a better choice. Because with how Terraform works at present, it considers this sort of configuration to be incorrect. Using resource computed outputs as arguments to count and for_each is not recommended. Instead, using variables or derived local values which are known at plan time is the preferred approach. If you choose to go ahead with using computed values for count/for_each, this will sometimes require you to work around this using -target as illustrated above. For more details, please refer to here
Besides, the bug will be fixed in the pre-release 0.14 code. For more details, please

How to add resource dependencies in terraform

I have created a gcp kubernetes cluster using terraform and configured a few kubernetes resources such as namespaces and helm releases. I would like terraform to automatically destroy/recreate all the kubernetes cluster resources if the gcp cluster is destroyed/recreated but I cant seem to figure out how to do it.
The behavior I am trying to recreate is similar to what you would get if you used triggers with null_resources. Is this possible with normal resources?
resource "google_container_cluster" "primary" {
name = "marcellus-wallace"
location = "us-central1-a"
initial_node_count = 3
resource "kubernetes_namespace" "example" {
metadata {
annotations = {
name = "example-annotation"
}
labels = {
mylabel = "label-value"
}
name = "terraform-example-namespace"
#Something like this, but this only works with null_resources
triggers {
cluster_id = "${google_container_cluster.primary.id}"
}
}
}
In your specific case, you don't need to specify any explicit dependencies. They will be set automatically because you have cluster_id = "${google_container_cluster.primary.id}" in your second resource.
In case when you need to set manual dependency you can use depends_on meta-argument.

How to iterate multiple resources over the same list?

New to Terraform here. I'm trying to create multiple projects (in Google Cloud) using Terraform. The problem is I've to execute multiple resources to completely set up a project. I tried count, but how can I tie multiple resources sequentially using count? Here are the following resources I need to execute per project:
Create project using resource "google_project"
Enable API service using resource "google_project_service"
Attach the service project to a host project using resource "google_compute_shared_vpc_service_project" (I'm using shared VPC)
This works if I want to create a single project. But, if I pass a list of projects as input, how can I execute all the above resources for each project in that list sequentially?
Eg.
Input
project_list=["proj-1","proj-2"]
Execute the following sequentially:
resource "google-project" for "proj-1"
resource "google_project_service" for "proj-1"
resource "google_compute_shared_vpc_service_project" for "proj-1"
resource "google-project" for "proj-2"
resource "google_project_service" for "proj-2"
resource "google_compute_shared_vpc_service_project" for "proj-2"
I'm using Terraform version 0.11 which does not support for loops
In Terraform, you can accomplish this using count and the two interpolation functions, element() and length().
First, you'll give your module an input variable:
variable "project_list" {
type = "list"
}
Then, you'll have something like:
resource "google_project" {
count = "${length(var.project_list)}"
name = "${element(var.project_list, count.index)}"
}
resource "google_project_service" {
count = "${length(var.project_list)}"
name = "${element(var.project_list, count.index)}"
}
resource "google_compute_shared_vpc_service_project" {
count = "${length(var.project_list)}"
name = "${element(var.project_list, count.index)}"
}
And of course you'll have your other configuration in those resource declarations as well.
Note that this pattern is described in Terraform Up and Running, Chapter 5, and there are other examples of using count.index in the docs here.
A small update to this question/answer (terraform 0.13 and above). The count or length is not advisable to use anymore due to the way that terraforms works, let's imagine the next scenario:
Suppose you have an array with 3 elements: project_list=["proj-1","proj-2","proj-3"], once you apply that if you want to delete the "proj-2" item from your array once you run the plan, terraform will modify your second element to "proj-3" instead of removing It from the list (more info in this good post). The solution to get the proper behavior is to use the for_each function as follow:
variable "project_list" {
type = list(string)
}
resource "google_project" {
for_each = toset(var.project_list)
name = each.value
}
resource "google_project_service" {
for_each = toset(var.project_list)
name = each.value
}
resource "google_compute_shared_vpc_service_project" {
for_each = toset(var.project_list)
name = each.value
}
Hope this helps! 👍

Resources