Cannot create many Azure firewall rule sets concurrently in Terraform - azure

My Terraform code is broadly architected like so:
module "firewall_hub" {
# This creates the Azure Firewall resource
source = "/path/to/module/a"
# attribute = value...
}
module "firewall_spoke" {
# This creates, amongst other things, firewall rule sets
source = "/path/to/module/b"
hub = module.firewall_hub
# attribute = value...
}
module "another_firewall_spoke" {
# This creates, amongst other things, firewall rule sets
source = "/path/to/module/c"
hub = module.firewall_hub
# attribute = value...
}
i.e., The Azure Firewall resource is created in module.firewall_hub, which is used as an input into module.firewall_spoke and module.another_firewall_spoke that create their necessary resources and inject firewall rule sets into the Firewall resource. Importantly, the rule sets are mutually exclusive between spoke modules and designed such that their priorities don't overlap.
When I try to deploy this code (either build or destroy), Azure throws an error:
Error: deleting Application Rule Collection "XXX" from Firewall "XXX (Resource Group "XXX"): network.AzureFirewallsClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status= Code="AnotherOperationInProgress" Message="Another operation on this or dependent resource is in progress. To retrieve status of the operation use uri: https://management.azure.com/subscriptions/XXX" Details=[]
My working hypothesis is that one cannot make multiple create/update/delete requests of firewall rule sets against the same firewall simultaneously, even if the rule sets are mutually exclusive. Indeed, if you wait a minute-or-so after the failed deployment and restart it -- without changing any Terraform code or manually updating resources in Azure -- it will happily carry on without error and complete successfully.
To test my assumption, I tried to workaround this by forcing serialisation of modules:
module "another_firewall_spoke" {
# This creates, amongst other things, firewall rule sets
source = "/path/to/module/c"
hub = module.firewall_hub
# attribute = value...
depends_on = [module.firewall_spoke]
}
However, unfortunately, this is not possible with the way my modules are written:
Providers cannot be configured within modules using count, for_each or depends_on.
Short of rewriting my modules (not an option), is it possible to get around this race condition -- if that's the problem -- or would you consider it a bug with the azurerm provider (i.e., it should recognise that API error response and wait its turn, up to some timeout)?
(Terraform v1.1.7, azurerm v2.96.0)

Following #silent's tip-off to this answer, I was able to resolve the race using the method described therein.
Something like this:
module "firewall_hub" {
# This creates the Azure Firewall resource
source = "/path/to/module/a"
# attribute = value...
}
module "firewall_spoke" {
# This creates, amongst other things, firewall rule sets
# Has an output "blockers" containing resources that cannot be deployed concurrently
source = "/path/to/module/b"
hub = module.firewall_hub
# attribute = value...
}
module "another_firewall_spoke" {
# This creates, amongst other things, firewall rule sets
source = "/path/to/module/c"
hub = module.firewall_hub
waits_for = module.firewall_spoke.blockers
# attribute = value...
}
So the trick is for your modules to export an output that contains a list of all the dependent resources that need to be deployed first. This can then be an input to subsequent modules, that is threaded through to the actual resources that require a depends_on value.
That is, in the depths of my module, resources have:
resource "some_resource" "foo" {
# attribute = value...
depends_on = [var.waits_for]
}
There are two important notes to bear in mind when using this method:
The wait_for variable in your module must have type any; list(any) doesn't work, as Terraform interprets this as a homogeneous list (which it most likely won't be).
Weirdly, imo, the depends_on clause requires you to explicitly use a list literal (i.e., [var.waits_for] rather than just var.waits_for), even if the variable you are threading through is a list. This doesn't type check in my head, but apparently Terraform is not only fine with it, but it expects it!

Related

Clarification on changes made outside of Terraform

I don't fully understand how Terraform handles external changes. Let's take an example:
resource "aws_instance" "ec2-test" {
ami = "ami-0d71ea30463e0ff8d"
instance_type = "t2.micro"
}
1: security group modification
The default security group has been manually replaced by another one. Terraform detects the change:
❯ terraform plan --refresh-only
aws_instance.ec2-test: Refreshing state... [id=i-5297abcc6001ce9a8]
Note: Objects have changed outside of Terraform
Terraform detected the following changes made outside of Terraform since the last "terraform apply" which may have affected this plan:
# aws_instance.ec2-test has changed
~ resource "aws_instance" "ec2-test" {
id = "i-5297abcc6001ce9a8"
~ security_groups = [
- "default",
+ "test",
]
tags = {}
~ vpc_security_group_ids = [
+ "sg-8231be9a95a4b1886",
- "sg-f2fc3af19c4adefe0",
]
# (28 unchanged attributes hidden)
# (7 unchanged blocks hidden)
}
No change planned:
❯ terraform plan
aws_instance.ec2-test: Refreshing state... [id=i-5297abcc6001ce9a8]
No changes. Your infrastructure matches the configuration.
Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.
It seems normal as we did not set the security_groups argument in the resource block (the desired state is aligned with the current state).
2: IAM instance profile added
An IAM role has been manually attached to the instance. Terraform also detects the change:
❯ terraform plan --refresh-only
aws_instance.ec2-test: Refreshing state... [id=i-5297abcc6001ce9a8]
Note: Objects have changed outside of Terraform
Terraform detected the following changes made outside of Terraform since the last "terraform apply" which may have affected this plan:
# aws_instance.ec2-test has changed
~ resource "aws_instance" "ec2-test" {
+ iam_instance_profile = "test"
id = "i-5297abcc6001ce9a8"
tags = {}
# (30 unchanged attributes hidden)
# (7 unchanged blocks hidden)
}
This is a refresh-only plan, so Terraform will not take any actions to undo these. If you were expecting these changes then you can apply this plan to record the updated values in the Terraform state without
changing any remote objects.
However, Terraform also plans to revert the change:
❯ terraform plan
aws_instance.ec2-test: Refreshing state... [id=i-5297abcc6001ce9a8]
Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
~ update in-place
Terraform will perform the following actions:
# aws_instance.ec2-test will be updated in-place
~ resource "aws_instance" "ec2-test" {
- iam_instance_profile = "test" -> null
id = "i-5297abcc6001ce9a8"
tags = {}
# (30 unchanged attributes hidden)
# (7 unchanged blocks hidden)
}
Plan: 0 to add, 1 to change, 0 to destroy.
I tried to figure out why these two changes don't produce the same effect. This article highlights differences depending on the argument default values: https://nedinthecloud.com/2021/12/23/terraform-apply-when-external-change-happens/
But the security_groups and iam_instance_profile arguments seems similar (optional with no default value), so why Terraform is handling these two cases differently?
(tested with Terraform v1.2.2, hashicorp/aws 4.21.0)
The handling of these situations unfortunately depends a lot on decisions made by the provider developer, since it's the provider's responsibility to decide how to reconcile any differences between the configuration and the prior state. (The "prior state" is what Terraform calls the state that results from running the "refresh" steps to synchronize with the remote system).
Terraform Core takes the values you've defined in the configuration (if any optional arguments are unset, Terraform Core uses null to represent that) and the values from the prior state and sends both of them to the provider to implement the planning step. The provider can then do whatever logic it wants as long as the planned new value for each attribute is consistent with the input. "Consistent" means that one of the following conditions is true:
The planned value is equal to the value set in the configuration.
This is the most straightforward situation to follow, but there are various reasons why a provider might not do this, which I'll discuss later.
The planned value is equal to the value stored in the prior state.
This represents situations where the value in the prior state is functionally equivalent to the value in the configuration but not exactly equal, such as if the remote system treats a particular string as case insensitive and the two values differ only in case.
The provider indicated in its schema that this is a value that can be decided by the remote system, such as an object ID that's generated by the remote system during the apply step, and the corresponding value in the configuration was null to represent the argument not being set at all.
In this case the provider gets to choose whichever value it wants, because the configuration says nothing about the attribute and thus the remote system has authority on what the value is.
From what you've described, it sounds like in your first example the provider used approach number 3, while in the second example the provider used approach number 1.
Since I am not the developer of this provider I cannot say for certain why the developers made the decisions they did here, but one common reason why a provider developer might choose option three is for situations where a particular value can potentially be set by multiple different resource types, in which case the provider might be designed to treat an absent argument in the configuration as meaning "keep whatever the remote system already has", whereas a non-null argument in the configuration would mean "set the remote system to use this given value".
For iam_instance_profile it seems like the provider considers null to be a valid configuration value for that argument and uses it to represent the EC2 instance having no associated instance profile at all. For vpc_security_groups and security_groups though, leaving the argument set to null in the configuration (or omitting it, which is equivalent) the provider treats that as "keep whatever the remote system has", and so Terraform just acknowledges the change but doesn't propose to undo it.
Based on my knowledge about EC2, I can guess that the reason here is probably that the underlying EC2 API has two different ways to set security groups: you can either use the legacy EC2-Classic style of specifying a security group by name (the security_groups argument in the provider), or the new EC2-VPC style of specifying it by ID (the vpc_security_group_ids argument in the provider). Whichever of the two you choose, the remote system will presumably populate the other one automatically and therefore without this special exception in the provider it would be impossible for any configuration to converge unless you set both security_groups and vpc_security_group_ids and set them to both refer to the same security groups. To avoid that, I think the provider just lets whichever one of the two you left unset automatically track the remote system, which has the side-effect the provider cannot automatically "fix" changes made outside of Terraform unless you set at least one of them so the provider can see what the correct value ought to be.
Terraform's ability to reconcile changes in the remote system by resetting back to match the configuration is a "best effort" mechanism because in many cases that requirement comes into conflict with other requirements, and provider developers must therefore decide on a case-by-case basis what to prioritize. Although Terraform does try its best to tell you about changes outside of Terraform and to propose fixing them where possible, the only certain way to keep your Terraform configuration and your remote system synchronized is to prevent anyone from making changes outside of Terraform, for example using IAM policies in AWS.

Terraform - why this is not causing circular dependency?

Terraform registry AWS VPC example terraform-aws-vpc/examples/complete-vpc/main.tf has the code below which seems to me a circular dependency.
data "aws_security_group" "default" {
name = "default"
vpc_id = module.vpc.vpc_id
}
module "vpc" {
source = "../../"
name = "complete-example"
...
# VPC endpoint for SSM
enable_ssm_endpoint = true
ssm_endpoint_private_dns_enabled = true
ssm_endpoint_security_group_ids = [data.aws_security_group.default.id] # <-----
...
data.aws_security_group.default refers to "module.vpc.vpc_id" and module.vpc refers to "data.aws_security_group.default.id".
Please explain why this does not cause an error and how come module.vpc can refer to data.aws_security_group.default.id?
In the Terraform language, a module creates a separate namespace but it is not a node in the dependency graph. Instead, each of the module's Input Variables and Output Values are separate nodes in the dependency graph.
For that reason, this configuration contains the following dependencies:
The data.aws_security_group.default resource depends on module.vpc.vpc_id, which is specifically the output "vpc_id" block in that module, not the module as a whole.
The vpc module's variable "ssm_endpoint_security_group_ids" variable depends on the data.aws_security_group.default resource.
We can't see the inside of the vpc module in your question here, but the above is okay as long as there is no dependency connection between output "vpc_id" and variable "ssm_endpoint_security_group_ids" inside the module.
I'm assuming that such a connection does not exist, and so the evaluation order of objects here would be something like this:
aws_vpc.example in module.vpc is created (I just made up a name for this because it's not included in your question)
The output "vpc_id" in module.vpc is evaluated, referring to module.vpc.aws_vpc.example, and producing module.vpc.vpc_id.
data.aws_security_group.default in the root module is read, using the value of module.vpc.vpc_id.
The variable "ssm_endpoint_security_group_ids" for module.vpc is evaluated, referring to data.aws_security_group.default.
aws_vpc_endpoint.example in module.vpc is created, including a reference to var.ssm_endpoint_security_group_ids.
Notice that in all of the above I'm talking about objects in modules, not modules themselves. The modules serve only to create separate namespaces for objects, and then the separate objects themselves (which includes individual variable and output blocks) are what participate in the dependency graph.
Normally this design detail isn't visible: Terraform normally just uses it to potentially optimize concurrency by beginning work on part of a module before the whole module is ready to process. In some interesting cases like this though, you can also intentionally exploit this design so that an operation for the calling module can be explicitly sandwiched between two operations for the child module.
Another reason why we might make use of this capability is when two modules naturally depend on one another, such as in an experimental module I built that hides some of the tricky details of setting up VPC peering connections:
locals {
vpc_nets = {
us-west-2 = module.vpc_usw2
us-east-1 = module.vpc_use1
}
}
module "peering_usw2" {
source = "../../modules/peering-mesh"
region_vpc_networks = local.vpc_nets
other_region_connections = {
us-east-1 = module.peering_use1.outgoing_connection_ids
}
providers = {
aws = aws.usw2
}
}
module "peering_use1" {
source = "../../modules/peering-mesh"
region_vpc_networks = local.vpc_nets
other_region_connections = {
us-west-2 = module.peering_usw2.outgoing_connection_ids
}
providers = {
aws = aws.use1
}
}
(the above is just a relevant snippet from an example in the module repository.)
In the above case, the peering-mesh module is carefully designed to allow this mutual referencing, internally deciding for each pair of regional VPCs which one will be the peering initiator and which one will be the peering accepter. The outgoing_connection_ids output refers only to the aws_vpc_peering_connection resource and the aws_vpc_peering_connection_accepter refers only to var.other_region_connections, and so the result is a bunch of concurrent operations to create aws_vpc_peering_connection resources, followed by a bunch of concurrent operations to create aws_vpc_peering_connection_accepter resources.

In Terraform 0.12, how to skip creation of resource, if resource name already exists?

I am using Terraform version 0.12. I have a requirement to skip resource creation if resource with the same name already exists.
I did the following for this :
Read the list of custom images,
data "ibm_is_images" "custom_images" {
}
Check if image already exists,
locals {
custom_vsi_image = contains([for x in data.ibm_is_images.custom_images.images: "true" if x.visibility == "private" && x.name == var.vnf_vpc_image_name], "true")
}
output "abc" {
value="${local.custom_vsi_image}"
}
Create only if image exists is false.
resource "ibm_is_image" "custom_image" {
count = "${local.custom_vsi_image == true ? 0 : 1}"
depends_on = ["data.ibm_is_images.custom_images"]
href = "${local.image_url}"
name = "${var.vnf_vpc_image_name}"
operating_system = "centos-7-amd64"
timeouts {
create = "30m"
delete = "10m"
}
}
This works fine for the first time with "terraform apply". It finds that the image did not exists, so it creates image.
When I run "terraform apply" for the second time. It is deleting the resource "custom_image" that is created above. Any idea why it is deleting the resource, when it is run for the 2nd time ?
Also, how to create a resource based on some condition(like only when it does not exists) ?
In Terraform, you're required to decide explicitly what system is responsible for the management of a particular object, and conversely which systems are just consuming an existing object. There is no way to make that decision dynamically, because that would make the result non-deterministic and -- for objects managed by Terraform -- make it unclear which configuration's terraform destroy would destroy the object.
Indeed, that non-determinism is why you're seeing Terraform in your situation flop between trying to create and then trying to delete the resource: you've told Terraform to only manage that object if it doesn't already exist, and so the first time you run Terraform after it exists Terraform will see that the object is no longer managed and so it will plan to destroy it.
If you goal is to manage everything with Terraform, an important design task is to decide how object dependencies flow within and between Terraform configurations. In your case, it seems like there is a producer/consumer relationship between a system that manages images (which may or may not be a Terraform configuration) and one or more Terraform configurations that consume existing images.
If the images are managed by Terraform then that suggests either that your main Terraform configuration should assume the image does not exist and unconditionally create it -- if your decision is that the image is owned by the same system as what consumes it -- or it should assume that the image does already exist and retrieve the information about it using a data block.
A possible solution here is to write a separate Terraform configuration that manages the image and then only apply that configuration in situations where that object isn't expected to already exist. Then your configuration that consumes the existing image can just assume it exists without caring about whether it was created by the other Terraform configuration or not.
There's a longer overview of this situation in the Terraform documentation section Module Composition, and in particular the sub-section Conditional Creation of Objects. That guide is focused on interactions between modules in a single configuration, but the same underlying principles apply to dependencies between configurations (via data sources) too.

Passing an attribute from a resource with count [0 or 1] defined to a module - possible?

Terraform 0.12.13, azurerm provider 1.35
Some background: I have a set of Azure App Services, hosted on an App Service Plan, in a Resource Group, in an Azure location. I now need to duplicate this stack in a different Azure location and add some additional resources like Traffic Managers and CNAMEs and whatnot in order to implement high availability. Architecturally we have Primary resources, and then a smaller subset of Secondary resources in the secondary region (not everything needs to be duplicated). Not every deployment will require high availability, so I need to be able to instantiate or not instantiate the Secondaries at run-time.
Because I was trying to be a good software engineer, I created modules to instantiate most of this stuff - one for the app services, one for the app service plan, one for the traffic managers, and so on.
The problem I have now is that I'm using the old count + ternary operator trick to control whether the secondary resources get created, and this is breaking because 1) count isn't allowed as a module meta-argument yet and 2) I can't figure out how to pass exported attributes from a resource controlled by the count meta-argument to a module as an input variable.
The following code may make this clearer.
resource "azurerm_resource_group" "appservices_secondary" {
name = "foo-services-ca-${local.secondary_release_stage_name}-${var.pipeline}-rg"
location = local.secondary_location
count = var.enable_high_availability ? 1 : 0
}
# Create the app service plan to host the secondary app services
module "plan_secondary" {
source = "./app_service_plan"
release_stage_name = local.secondary_release_stage_name
# HERE'S THE PROBLEMATIC LINE
appsvc_resource_group_name = azurerm_resource_group.appservices_secondary[0].name
location = local.secondary_location
pipeline = var.pipeline
}
If count resolves to 1 (var.enable_high_availability = true) then everything's fine.
If count resolves to 0 (var.enable_high_availability = false) then terraform plan fails:
Error: Invalid index
on .terraform\modules\services\secondary.tf line 25, in module "plan_secondary":
25: appsvc_resource_group_name = azurerm_resource_group.appservices_secondary[0].name
|----------------
| azurerm_resource_group.appservices_secondary is empty tuple
The given key does not identify an element in this collection value.
If I change the input variable value to azurerm_resource_group.appservices_secondary.name then it won't pass terraform validate because it recognizes that it needs [count.index].
Is there a simple way to resolve this? I'm increasingly thinking this is a design problem and I should have built the modules with count = [1..2] rather than count = 1 (primary) and count = [0 || 1] (secondary) but that will require me to rewrite all the modules and I'd like to avoid that if there's some clever workaround.
In order to resolve this you can use a conditional expression for appsvc_resource_group_name to provide some alternative value to use when the azurerm_resource_group.appservices_secondary resource has count = 0:
appsvc_resource_group_name = length(azurerm_resource_group.appservices_secondary) > 0 ? azurerm_resource_group.appservices_secondary[0].name : "default-value"
It looks like this other module is not useful in situations where high availability is disabled. In that case, you might want to define the variable as being optional with a default of null so that you can recognize when it isn't set in the module:
variable "appsvc_resource_group_name" {
type = string
default = null
}
Elsewhere in the configuration you can test var.appsvc_resource_group_name != null to see if it's enabled.
When following the module composition patterns I'd likely instead build this as two modules, using one of the following two strategies:
One module for building a "normal" (non-HA) stack and another module for building a HA stack, and then choose which one to use in the root module of each configuration depending on whether a particular configuration needs the normal or HA mode.
Alternatively, if the HA stack is always a superset of the "normal" stack, have one module for the normal stack, and then another module that consumes the outputs of the first and describes the extended resources needed for HA mode.
Here's an example of the second of those approaches, just to illustrate what I mean by it:
module "primary_example" {
source = "./primary_example"
# whatever arguments are needed
}
module "secondary_example" {
source = "./secondary_example"
# Make sure the primary module exports as outputs all of the
# values required to extend to HA mode, and then just pass
# that whole object through to secondary.
primary = module.primary_example
}
In a configuration that doesn't need HA mode you can then omit module "secondary_example".
The module composition patterns are about decomposing the configuration into small pieces that describe one self-contained capability and then letting the root module select from those capabilities whatever subset of them are relevant and connecting them in a suitable way.
In this case, I'm treating non-HA infrastructure as one capability and then HA extensions to that infrastructure as a second capability that depends on the first, connecting them together in a dependency inversion style so that the HA extensions can just assume that a non-HA deployment already exists and that information about it will be passed in by its caller.

setting value of variable terraform in tfvars file for nested structure

terraform has adjusted its authorization
in main.tf [for sql config] I now have:
resource "google_sql_database_instance" "master" {
name = "${random_id.id.hex}-master"
region = "${var.region}"
database_version = "POSTGRES_9_6"
# allow direct access from work machines
ip_configuration {
authorized_networks = "${var.authorized_networks}"
require_ssl = "${var.sql_require_ssl}"
ipv4_enabled = true
}
}
where
in variables.tf I have
variable "authorized_networks" {
description = "The networks that can connect to cloudsql"
type = "list"
default = [
{
name = "work"
value = "xxx.xxx.xx.xxx/32"
}
]
}
where xxx.xxx.xx.xxx is the ip address I would like to allow. However, I prefer not to put this in my variables.tf file, but rather in a non-source controlled .tfvars file.
for variables that have a simple value, this is easy, but it is not clear to me how to do it with the nested structure. Replacing xxx.xxx.xx.xxx by a variable [e.g. var.work_ip] leads to an error
variables may not be used here
any insights?
If you omit the default argument in your main configuration altogether, you will mark variable "authorized_networks" as a required input variable, which Terraform will then check to ensure that it is set by the caller.
If this is a root module variable, then you can provide the value for it in a .tfvars file using the following syntax:
authorized_networks = [
{
name = "work"
value = "xxx.xxx.xx.xxx/32"
}
]
If this file is being generated programmatically by some wrapping automation around Terraform, you can also write it into a .tfvars.json file and use JSON syntax, which is often easier to construct robustly in other languages:
{
"authorized_networks": [
{
"name": "work",
"value": "xxx.xxx.xx.xxx/32"
}
]
}
You can either specify this file explicitly on the command line using the -var-file option, or you can give it a name ending in .auto.tfvars or .auto.tfvars.json in the current working directory when you run Terraform and Terraform will then find and load it automatically.
A common reason to keep something out of version control is because it's a dynamic setting configured elsewhere in the broader system rather than a value fixed in version control. If that is true here, then an alternative strategy is to save that setting in a configuration data store that Terraform is able to access via data sources and then write your Terraform configuration to retrieve that setting directly from the place where it is published.
For example, if the network you are modelling here were a Google Cloud Platform subnetwork, and it has either a fixed name or one that can be derived systematically in Terraform, you could retrieve this setting using the google_compute_subnetwork data source:
data "google_compute_subnetwork" "work" {
name = "work"
}
Elsewhere in configuration, you can then use data.google_compute_subnetwork.work.ip_cidr_range to access the CIDR block definition for this network.
The major Terraform providers have a wide variety of data sources like this, including ones that retrieve specific first-class objects from the target platform and also more generic ones that access configuration stores like AWS Systems Manager Parameter Store or HashiCorp Consul. Accessing the necessary information directly or publishing it "online" in a configuration store can be helpful in a larger system to efficiently integrate subsystems.

Resources