Resolving broken deleted state in terraform - terraform

When terraform tries to deploy something and then times out in a state like pending or deleting the state will eventually update to successful or deleted but this never gets updated in the tf state so when I try to run something again it errors because the state doesn't match.
Error: error waiting for EC2 Transit Gateway VPC Attachment (tgw-attach-xxxxxxxxx) deletion: unexpected state 'failed', wanted target 'deleted'. last error: %!s(<nil>)
What is the correct way to handle this? Can I do something within terraform to get it to recognise the latest state in AWS? Is it a bug on tf's part?

tl; dr
It's probably less of a bug and more of a design choice.
You should investigate and if appropriate (e.g. the resource was created or deleted successfully and the state was not updated appropriately), you could either
run terraform refresh, which will cause Terraform to refresh its state file against what actually exists with the cloud provider
manually reconcile the situation by manipulating the Terraform state with the terraform state command, removing deleted resources or adding created resources
Detail
Unlike CloudFormation, Terraform's approach to 'failures' is to just drop everything and error out, leaving the operator to investigate the issue and attempt to resolve it themselves. As a result, operations which timeout are classed as failures and so the relevant resources are often not updated in Terraform's state.
Terraform does give us some recourse to handle this however. For one, we can manually manipulate Terraform's state file. We can add resources or remove resources from the state file as we like, though this should be done with caution.
We can also ask Terraform to 'refresh' its state, basically comparing the state file to reality. Implicitly this should remove resources which no longer exist, but it will not adopt resources into the state file which were provisioned outside of a successful Terraform run.
As an aside, timeouts relating to the interaction with any service provider, are a feature of the relevant Terraform Provider, in this case the AWS Provider. Only the Providers can expose configurable timeouts. For example, the AzureRM Provider does provide a means to configure timeouts, but it appears the AWS Provider does not.
Efforts are presumably made to incorporate sensible timeout values, but it's not unusual to see trivial operations take an age to complete properly.

Related

What does terraform apply/plan refresh-only do?

So I'm a bit confused on what terraform plan refresh-only is giving me. Essentially with just terraform plan it was saying it detected changes outside of terraform (that was me) and it was trying to "correct" these changes, sadly correcting these change requires the recreation of the resource. However if I add "refresh-only" after the plan, it removes that recreation and now says it will update the tfstate to match what changes I have done manually.
Is my understanding of this correct or are there things I'm missing?
A "normal" terraform plan includes two main behaviors:
Update the state from the previous run to reflect any changes made outside of Terraform. That's called "refreshing" in Terraform terminology.
Comparing that updated state with the desired state described by the configuration, and in case of any differences generating a proposed set of actions to change the real remote objects to match the desired state.
When you create a "refresh-only" plan, you're disabling the second of those, but still performing the first. Terraform will update the state to match changes made outside of Terraform, and then ask you if you want to commit that result as a new state snapshot to use on future runs. Typically the desired result of a refresh-only plan is for Terraform to report that there were no changes outside of Terraform, although Terraform does allow you to commit the result as a new state snapshot if you wish, for example if the changes cascaded from an updated object used as a data resource and you want to save those new results.
A refresh-only plan prevents Terraform from proposing any actions that would change the real infrastructure for that particular plan, but it does not avoid the need to deal with any differences in future plans. If the changes that Terraform is proposing are not acceptable then to move forward you will either need to change the configuration to match your actual desired state (for example, to match the current state of the object you don't want to replace) or change the real infrastructure (outside of Terraform) so it will match your configuration.

Terraform plan: Saved plan is stale

How do I force Terraform to rebuild its plans and tfstate files from scratch?
I'm considering moving my IAC from GCP's Deployment Manager to Terraform, so I thought I'd run a test, since my TF is pretttty rusty. In my first pass, I successfully deployed a network, subnet, firewall rule, and Compute instance. But it was all in a single file and didn't scale well for multiple environments.
I decided to break it out into modules (network and compute), and I was done with the experiment for the day, so I tore everything down with a terraform destroy
So today I refactored everything into its modules, and accidentally copypasta-ed the network resource from the network module to the compute module. Ran a terraform plan, and then a terraform apply, and it complained about the network already existing.
And I thought that it was because I had somehow neglected to tear down the network I'd created the night before? So I popped over to the GCP console, and yeah, it was there, so...I deleted it. In the UI. Sigh. I'm my own chaos engineer.
Anyway, somewhere right around there, I discovered my duplicate resource and removed it, realizing that the aforementioned complaint about the "network resource already existing" was coming from the 2nd module to run.
And I ran a terraform plan again, and it didn't complain about anything, so I ran a terraform apply, and that's when I got the "stale plan" error. I've tried the only thing I could think of - terraform destroy, terraform refresh - and then would try a plan and apply after that,
I could just start fresh from a new directory and new names on the tfstate/tfplan files, but it bothers me that I can't seem to reconcile this "stale plan" error. Three questions:
Uggh...what did I do wrong? Besides trying to write good code after a 2-hour meeting?
Right now this is just goofing around, so who cares if everything gets nuked? I'm happy to lose all created resources. What are my options in this case?
If I end up going to prod with this, obviously idempotence is a priority here, so what are my options then, if I need to perform some disaster recovery? (Ultimately, I would be using remote state to make sure we've got the tfstate file in a safe place.
I'm on Terraform 0.14.1, if that matters.
Saved plan is stale means out of date. Your plan is matching the current state of your infrastructure.
Either the infrastructure was changed outside of terraform or used terraform apply without -save flag.
Way 1: To fix that you could run terraform plan with the -out flag to save the new plan and re-apply it later on.
Way 2: But more easily I would use terraform refresh and after that terraform apply
I created the infrastructure via the gcloud CLI first for testing purposes. As soon as it was proven as working, I transferred the configuration to gitlab and encountered the same issue in one of my jobs. The issue disappeared after I changed the network's and cluster's names.

What happen when removing resources from terraform configuration file

I am trying to find my way in terraform , i am going through the documentation and got confused about what is exactly will happen if a resource has been deleted manually from the configuration file then we ran the apply command on the modified configuration file please ?
my understanding that the state file will still have the deleted resource as well it will still be actually running on the cloud platform so terraform apply will not perform any action, but i am not sure.
Appreciate if you help to clear my understanding please
Also another relevant point please what if a resource was changed manually from the cloud console for example and we tried to do any action from terraform on that resouces , what will happen ?
Thanks a lot ,
First, some background from the documentation at https://www.terraform.io/intro/index.html
Terraform generates an execution plan describing what it will do to reach the desired state, and then executes it to build the described infrastructure. As the configuration changes, Terraform is able to determine what changed and create incremental execution plans which can be applied.
The state mentioned gets saved when resources are modified. When you add a resource, it will get created and the state will be updated to reflect this. The same is for removing of resources. When you remove a resource in Terraform by deleting the code or template file, the resource will be removed and state updated to to reflect the removed resource (emphasis mine to illustrate the answer).
The second question of changing resources that drift from the state is a little more involved. When you create a plan against a resource that may have change, the provider will usually refresh the resources in state to compare them and show you what the changes to be made will be (ie. trying to change the resource to the declared state in the code).

Get Terraform to Always update a resource, even if there are no changes

I have a terraform provider that works against an API that can out of band "lose" api objects. I've talked to the vendor and they said its in their backlog but don't seem to have any interest in actually fixing it.
The Terraform provider also does not use data sources, only remote state so that means it cannot actually detect that the field is missing or anything.
If I force terraform to update the resource by changing the state, the field returns however its hacky. Is there something like "ignore_changes" that always updates a resource?
One of the Terraform provider API operations for managed resource instances is ReadResource, which in the Terraform SDK maps tothe Read callback in schema.Resource.
The purpose of this function is to use the information that was recorded in the Terraform state at the end of the last operation to produce a new object that describes the current state of the object in the remote system. As well as detecting "drift" for attributes of the object itself, this function can also potentially detect and report that the remote object is no longer present.
In the current SDK API, a Read implementation can report that the object no longer exists by calling d.SetId with an empty string as the ID, because the SDK requires that any valid object have a non-empty ID.
Terraform calls ReadResource as part of preparing a plan in terraform plan, or in the implied planning step of a no-arguments terraform apply. Therefore if the provider is able to detect and signal when the object no longer exists or if any of its attributes have changed since the conclusion of the last operation. Terraform will then take that drift into account when comparing with the configuration to produce the proposed plan.
Terraform's workflow generally expects that all resources will converge on a stable state so that users can run terraform plan and see the message that there are no changes pending. A situation where a provider perpetually proposes more changes on every plan is usually considered to be a bug in the provider, and so I wouldn't suggest that as a solution to your problem and Terraform has no features intended to provide such a capability.

How do I cause terraform to skip destroying resources?

I am using terraform to provision an Azure AKS Kubernetes cluster, including a bunch of namespaces, deployments (e.g., cert-manager, external-dns, etc), secrets, and so on. These all get deleted when the cluster is torn down, but some of them cannot be deleted by terraform. This happens most often with namespaces, like the following (it never actually finishes removing all content):
"Operation cannot be fulfilled on namespaces "cert-manager": The system is ensuring all content is removed from this namespace. Upon completion, this namespace will automatically be purged by the system."
How do I cause terraform to ignore these resources when destroying?
On the surface, this seems like a big ask from Terraform
Terraform manages state, so it knows what it created, and what resources depend on each other. When it destroys something, it knows what dependencies to destroy as well, and this sets up an ordering of operations.
So it seems you're saying you want Terraform to control the creation, but to "forget" to destroy some things, despite it keeping a map of dependencies. This seems like a good way to get a corrupt state.
So with that caveat in mind, perhaps you could try "terraform state rm" judiciously, so that terraform isn't managing the things that need to be skipped when destroying things.
Something like
terraform apply
some script that picks holes in the state with "terraform state rm"
terraform destroy
The hard part is making sure all the things that remain do not reference anything that has been "rm'd" - terraform will get mad at you and probably refuse to do it

Resources