How to detect changes made outside of Terraform? - terraform

I have been using Terraform now for some months, and I have reached the point where my infrastructure is all base in Terraform files and I now have better control of the resources in our multiple accounts.
But I have a big problem. If someone makes a "manual" alteration of any Terraformed resource, it is easy to detect the change.
But what happens if the resource was not created using Terraform? I just don't know how to track any new resource or changes in them if the resource was not created using Terraform.

A key design tradeoff for Terraform is that it will only attempt to manage objects that it created or that you explicitly imported into it, because Terraform is often used in mixed environments where either some objects are managed by other software (like an application deployment tool) or the Terraform descriptions are decomposed into multiple separate configurations designed to work together.
For this reason, Terraform itself cannot help with the problem of objects created outside of Terraform. You will need to solve this using other techniques, such as access policies that prevent creating objects directly, or separate software (possibly created in-house) that periodically scans your cloud vendor accounts for objects that are not present in the expected Terraform state snapshot(s).
Access policies are typically the more straightforward path to implement, because preventing objects from being created in the first place is easier than recognizing objects that already exist, particularly if you are working with cloud services that create downstream objects as a side-effect of their work, as we see with (for example) autoscaling controllers.

Martin's answer is excellent and explains that Terraform can't be the arbiter of this as it is designed to play nicely both with other tooling and with itself (ie across different state files).
He also mentioned that access policies (although these have to be cloud/provider specific) are a good alternative to this so this answer will instead provide some options here for handling this with AWS if you do want to enforce this.
The AWS SDKs and other clients, including Terraform, all provide a user agent header in all requests. This is recorded by CloudTrail and thus you can search through CloudTrail logs with your favourite log searching tools to look for API actions that should be done via Terraform but don't use Terraform's user agent.
The other option that uses the user agent request header is to use IAM's aws:UserAgent global condition key which will block any requests that don't match the user agent header that's defined. An example IAM policy may look like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1598919227338",
"Action": [
"dlm:GetLifecyclePolicies",
"dlm:GetLifecyclePolicy",
"dlm:ListTagsForResource"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Sid": "Stmt1598919387700",
"Action": [
"dlm:CreateLifecyclePolicy",
"dlm:DeleteLifecyclePolicy",
"dlm:TagResource",
"dlm:UntagResource",
"dlm:UpdateLifecyclePolicy"
],
"Effect": "Allow",
"Resource": "*",
"Condition": {
"StringLike": {
"aws:UserAgent": "*terraform*"
}
}
}
]
}
The above policy allows the user, group or role it is attached to to be able to perform read only tasks to any DLM resource in the AWS account. It then allows any client with a user agent header including the string terraform to perform actions that can create, update or delete DLM resources. If a client doesn't have terraform in the user agent header then any requests to modify a DLM resource will be denied.
Caution: It's worth noting that clients can override the user agent string and so this shouldn't be relied on as a foolproof way of preventing access to things outside of this. The above mentioned techniques are mostly useful to get an idea about the usage of other tools (eg the AWS Console) in your account where you would prefer changes to be made by Terraform only.
The AWS documentation to the IAM global condition keys has this to say:
Warning
This key should be used carefully. Since the aws:UserAgent
value is provided by the caller in an HTTP header, unauthorized
parties can use modified or custom browsers to provide any
aws:UserAgent value that they choose. As a result, aws:UserAgent
should not be used to prevent unauthorized parties from making direct
AWS requests. You can use it to allow only specific client
applications, and only after testing your policy.
The Python SDK, boto, covers how the user agent string can be modified in the configuration documentation.

I haven't executed it but my idea has always been that this should be possible with a consistent usage of tags. A first naive
provider "aws" {
default_tags {
tags = {
Terraform = "true"
}
}
}
should be sufficient in many cases.
If you fear rogue developers will add this tag manually so as to hide their hacks, you could convolute your terraform modules to rotate the tag value over time to unpredictable values, so you could still search for inappropriately tagged resources. Hopefully the burden for them to overcome such mechanism will defeat the effort of simply terraforming a project. (Not for you)
On the downside, many resources will legitimately be not terraformed, e.g. DynamoDB tables or S3 items. A watching process should somehow whitelist what is allowed to exist. Not computational resources, that's for sure.
Tuning access policies and usage of CloudTrail as #ydaetskcoR suggests might be unsuitable to assess the extent of unterraformed legacy infrastructure, but are definitely worth the effort anyway.
This Reddit thread https://old.reddit.com/r/devops/comments/9rev5f/how_do_i_diff_whats_in_terraform_vs_whats_in_aws/ discusses this very topic, with some attention gathered around the sadly archived https://github.com/dtan4/terraforming , although it feels too much IMHO.

Related

Azure Machine Learning Computes - Template properties - Required properties for attach operation

As described in https://learn.microsoft.com/en-us/azure/templates/microsoft.machinelearningservices/workspaces/computes?tabs=bicep there are the properties location, sku, tags and identity.
For me it is not clear whether these properties relate to the parent workspace or to the compute resource (e.g. as the there is also computeLocation or sku as far as I can see should have the same value as the workspace)...
It would be great when someone can clarify to which resource these properties and related values belong to (workspace vs. compute resource).
EDIT:
Also: which properties are actually required for attach versus create? E.g. do I need identity or computeLocation for attach, and if yes what is the purpose of it as the compute resource is being created in another context?
I also figured out that location as well as disableLocalAuth are required for the attach operation - why when the resource is being deployed in another context and only attached?
And why do I get unsupported compute type when checking for the compute resources via Azure CLI for the attached AKS?
{
"description": "Default AKS Cluster",
"id": "/subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.MachineLearningServices/workspaces/xxx/computes/DefaultAKS",
"location": "westeurope",
"name": "DefaultAKS",
"provisioning_state": "Succeeded",
"resourceGroup": "xxx",
"resource_id": "/subscriptions/xxx/resourcegroups/xxx/providers/Microsoft.ContainerService/managedClusters/xxx",
"type": "*** unsupported compute type ***"
}
EDIT-2:
So based on the response from #SairamTadepalli-MT all the properties actually relate to the compute resource - what makes sense. Still, I don't understand the purpose of a few of these properties. For instance why is there a "location" and a "computeLocation" or what is the meaning of "sku" (e.g. I tried "AmlCompute" and provided the value "Basic" - but "Basic" is the "sku" of the workspace and for "AmlCompute" the size is actually defined by the "vmSize" or?...).
What brings me to the next point: the current documentation currently lacks a detailed description in which scenarios which properties can have which values respectively need to be provided (beside "properties").
This is also true for attach (i.e. providing a "resourceId") vs. create (i.e. providing "properties"): which properties are actually required for attach? For what I figured out it requires "location" and "disableLocalAuth" - why do I need these properties as I would assume "name" and "resourceId" (and maybe "computeType") should be sufficient to attach a compute resource? What is the purpose of properties like "sku", "tags" or "identity" when I attach an existing compute resource?
Finally regarding "unsupported compute type": not sure if your response really helps me. The AKS is successfully attached, so I don't understand why I get "unsupported compute type". This should be fixed.
Before identifying the properties and their uses, we need to understand a small difference between workspaces and computer resources
Workspaces: The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. The workspace keeps a history of all training runs, including logs, metrics, output, and a snapshot of your scripts. You use this information to determine which training run produces the best model.
https://learn.microsoft.com/en-us/azure/machine-learning/concept-workspace
on the other hand let's check about computer resources
Compute Resources: An Azure Machine Learning compute instance is a managed cloud-based workstation for data scientists. Compute instances make it easy to get started with Azure Machine Learning development as well as provide management and enterprise readiness capabilities for IT administrators.
https://learn.microsoft.com/en-us/azure/machine-learning/concept-compute-instance
Workspaces are used to work on machine learning resources and identify all the operations which are completed and in progress or even failed as a log.
Compute resources are the kind of administration operations which can handle the authority on the resource which was shared and created.
Moving on to the next part of the question. Regarding the properties.
Location: This is used to identify where the compute resources are created for ML operations. Comes under compute resources.
SKU: This is regarding the tier and price of each tier which we have chosen in our subscription. This is related to the computer resources.
Tags: Each resource created need to be recognized. Tags are compute resource identifiers.
Identity: Subscription details of the compute resource. This can be further need to be used with workspace which deploying ML application.
Location:: The location where the parent code is located. Where the output is stored will be defined
disableLocalAuth: Azure Automation provides Microsoft Azure Active Directory (Azure AD) authentication support for all Automation service public endpoints. This critical security enhancement removes certificate dependencies and gives organizations control to disable local authentication methods. This feature provides you with seamless integration when centralized control and management of identities and resource credentials through Azure AD is required.
unsupported compute type: Azure Kubernetes Service (AKS) clusters, you might occasionally come across problems. Refer the link mentioned below for better trouble shooting
https://learn.microsoft.com/en-us/azure/aks/troubleshooting

Prevent resources from being deleted via Terraform/ Azure Console in a large team

Current situation: We are at the beginning of a migration to the Azure Cloud. It is an enterprise project with many services. There are several people on the team who have little experience with Terraform or Azure.
Goals:
In best case, all resources of the production Azure subscription are managed with Terraform so that changes can be easily tracked and a new empty subscription (e.g. test subscription) can be quickly brought up to the same level. All with as few manual steps as possible.
In my absence, changes can also be made by inexperienced team members, but it should be prevented that certain resources are accidentally deleted. In addition, this should also apply to me, since I have not been doing the whole thing for so long and mistakes can always happen. I would like to include several hurdles here to avoid accidental deletion of some resources both on Terraform and in the Azure Gui.
What I have tried:
1.
lifecycle {
prevent_destroy = true
}
This prevents deletion using "Terraform destroy" but not, for example, deletion of the resource if someone deletes it from the tf file or deletes the entire ts file and then does "Terraform apply" and then oversees the deletion. It also does not prevent deletion in the Azure Gui.
Using azurerm_management_lock
E.g. resource1.tf
resource "azurerm_resource_group" "rgtest1" {
name = "rgtest1"
location = "westeurope"
lifecycle {
prevent_destroy = true
}
}
resource "azurerm_management_lock" "resource-group-level" {
name = "resource-group-level"
scope = azurerm_resource_group.rgtest1.id
lock_level = "CanNotDelete"
notes = "This Resource Group should not be deleted"
}
Again, doing a general "terraform apply" will not prevent the deletion of the resource if the file is accidentally deleted. Is it possible to keep the lock configuration in another terraform workspace where both workspaces have the same backend and therefore the same terraform state terraform.tfstate?
Manually setting the resource lock in the Azure Gui/Console
Since it may be multiple resources and I wanted to keep it as simple as possible, I don't feel this is a really good solution.
terraform state rm
I don't like the solution because it should be possible to change the locked resources with Terraform sometimes. In addition, these would be manual commands. It would be better to have this within the Terraform files. Is this possible?
Question:
I must apologize for the complexity of the question. Does anyone have a suggestion in which direction I should go to efficiently prevent accidental deletion (even with owner rights) in a team? Maybe a good "Four eyes principle"?

What is the meaning of "authoritative" and "authoritative" for GCP IAM bindings/members

I am trying to understand the difference between google_service_account_iam_binding and google_service_account_iam_member in the GCP terraform provider at https://www.terraform.io/docs/providers/google/r/google_service_account_iam.html.
I understand that google_service_account_iam_binding is for granting a role to a list of members whereas google_service_account_iam_member is for granting a role to a single member, however I'm not clear on what is meant by "Authoritative" and "Non-Authoritative" in these definitions:
google_service_account_iam_binding: Authoritative for a given role. Updates the IAM policy to grant a role to a list of members. Other roles within the IAM policy for the service account are preserved.
google_service_account_iam_member: Non-authoritative. Updates the IAM policy to grant a role to a new member. Other members for the role for the service account are preserved.
Can anyone elaborate for me please?
"Authoritative" means to change all related privileges, on the other hand, "non-authoritative" means not to change related privileges, only to change ones you specified.
Otherwise, you can interpret authoritative as the single source of truth, and non-authoritative as a piece of truth.
This link helps a lot.
Basically it means: if a role is bound to a set of IAM identities and you want to add more identities, authoritative one will require you to specify all the old identities again plus the new identies you wanna add otherwise any old identities you didn't specify will be unbinded from the role.
It is quite close to the idea of force push in git cause it will overwrite any existing stuff. In our case it is identity.
Non-authoritative is the opposite:
You only need to care the identity you are updating
Authoritative may remove existing configurations and destroy your project, while Non-Authoritative does not.
The consequence of using the Authoritative resource can be severely destructive. You may regret if you used them. Do not use them unless you are 100% confident that you must use Authoritative resources.
Usability improvements for *_iam_policy and *_iam_binding resources #8354
I'm sure you know by now there is a decent amount of care required when using the *_iam_policy and *_iam_binding versions of IAM resources. There are a number of "be careful!" and "note" warnings in the resources that outline some of the potential pitfalls, but there are hidden dangers as well. For example, using the google_project_iam_policy resource may inadvertently remove Google's service agents' (https://cloud.google.com/iam/docs/service-agents) IAM roles from the project. Or, the dangers of using google_storage_bucket_iam_policy and google_storage_bucket_iam_binding, which may remove the default IAM roles granted to projectViewers:, projectEditors:, and projectOwners: of the containing project.
The largest issue I encounter with people running into the above situations is that the initial terraform plan does not show that anything is being removed. While the documentation for google_project_iam_policy notes that it's best to terraform import the resource beforehand, this is in fact applicable to all *_iam_policy and *_iam_binding resources. Unfortunately this is tedious, potentially forgotten, and not something that you can abstract away in a Terraform module.
See terraform/gcp - In what use cases we have no choice but to use authoritative resources? and reported issues.
A simple example. If you run the script, what you think will happen. Do you think you can continue using your GCP project?
resource "google_service_account" "last_editor_standing" {
account_id = "last_editor_standing"
display_name = "last editor you will have after running this terraform"
}
resource "google_project_iam_binding" "last_editor_standing" {
project = "ToBeDemised"
members = [
"serviceAccount:${google_service_account.last_editor_standing.email}"
]
role = "roles/editor"
}
This will at least delete the Google APIs Service Agent which is essential to your project.
If you still think it is the type of resource to use, use at own your risk.

How can one destroy terraform configuration in Azure efficiently?

I have an Azure terraform configuration. It sets up resource groups, key vaults, passwords, etc ...
When I destroy it terraform does the same in reverse - deleting secrets, access polices, key vaults and the last are resource groups.
But, if the resource groups are to be destroyed anyway, it makes sense just to destroy them first - all the child resources will be deleted automatically. But the azurerm provider does not do it this way.
What am I missing here? And if my understanding is correct, is there a way to implement it (without altering the provider, that is) ?
In Terraform's model, each resource is distinct. Although Terraform can see the dependencies you've defined or implied between them, it doesn't actually understand that e.g. a key vault is a child object of a resource group and so the key vault might be deleted as a side-effect of deleting the resource group.
With that said, unfortunately there is no built-in way in Terraform today to achieve the result you are looking for.
A manual approximation of the idea would be to use terraform state rm to tell Terraform to "forget" about each of the objects (that is, they will still exist in Azure but Terraform will have no record of them) that will eventually be destroyed as a side-effect of deleting the resource group anyway, and then running terraform destroy will only delete the resource group, because Terraform will believe that none of the other objects exist yet anyway. However, that is of course a very manual approach that, without some careful scripting, would likely take longer than just letting the Azure provider work through all of the objects in dependency order.
There is an exploratory issue in the Terraform repository that covers this use-case (disclaimer: I'm the author of that issue), but the Terraform team isn't actively working on that at the time I write this, because efforts are focused elsewhere. The current set of use-cases captured there doesn't quite capture your idea here of having Terraform recognize when it can skip certain destroy operations, so you might choose to share some details about your use-case on that issue to help inform potential future design efforts.
Terraform is built this way, it wouldn't traverse the graph and understand that if the resource group is deleted - anything inside resource group will be deleted as well. which isn't even true in some cases. So I would say it doesn't make sense to do that.
Only real time when this is annoying - when you are testing. for that time you can create a script that would initiate resource group deletion and clear local state, for example

Terraform and cleartext password in (remote) state file

There are many Git issues opened on the Terraform repo about this issue, with lots of interesting comments, but as of now I still see no solution to this issue.
Terraform stores plain text values, including passwords, in tfstate files.
Most users are required to store them remotely so the team can work concurrently on the same infrastructure with most of them storing the state files in S3.
So how do you hide your passwords?
Is there anyone here using Terraform for production? Do you keep you passwords in plain text?
Do you have a special workflow to remove or hide them? What happens when you run a terraform apply then?
I've considered the following options:
store them in Consul - I don't use Consul
remove them from the state file - this requires another process to be executed each time and I don't know how Terraform will handle the resource with an empty/unreadable/not working password
store a default password that is then changed (so Terraform will have a not working password in the tfstate file) - same as above
use the Vault resource - sounds it's not a complete workflow yet
store them in Git with git-repo-crypt - Git is not an option either
globally encrypt the S3 bucket - this will not prevent people from seeing plain text passwords if they have access to AWS as a "manager" level but it seems to be the best option so far
From my point of view, this is what I would like to see:
state file does not include passwords
state file is encrypted
passwords in the state file are "pointers" to other resources, like "vault:backend-type:/path/to/password"
each Terraform run would gather the needed passwords from the specified provider
This is just a wish.
But to get back to the question - how do you use Terraform in production?
I would like to know what to do about best practice, but let me share about my case, although it is a limited way to AWS. Basically I do not manage credentials with Terraform.
Set an initial password for RDS, ignore the difference with lifecycle hook and change it later. The way to ignore the difference is as follows:
resource "aws_db_instance" "db_instance" {
...
password = "hoge"
lifecycle {
ignore_changes = ["password"]
}
}
IAM users are managed by Terraform, but IAM login profiles including passwords are not. I believe that IAM password should be managed by individuals and not by the administrator.
API keys used by applications are also not managed by Terraform. They are encrypted with AWS KMS(Key Management Service) and the encrypted data is saved in the application's git repository or S3 bucket. The advantage of KMS encryption is that decryption permissions can be controlled by the IAM role. There is no need to manage keys for decryption.
Although I have not tried yet, recently I noticed that aws ssm put-parameter --key-id can be used as a simple key value store supporting KMS encryption, so this might be a good alternative as well.
I hope this helps you.
The whole remote state stuff is being reworked for 0.9 which should open things up for locking of remote state and potentially encrypting of the whole state file/just secrets.
Until then we simply use multiple AWS accounts and write state for the stuff that goes into that account into an S3 bucket in that account. In our case we don't really care too much about the secrets that end up in there because if you have access to read the bucket then you normally have a fair amount of access in that account. Plus our only real secrets kept in state files are RDS database passwords and we restrict access on the security group level to just the application instances and the Jenkins instances that build everything so there is no direct access from the command line on people's workstations anyway.
I'd also suggest adding encryption at rest on the S3 bucket (just because it's basically free) and versioning so you can retrieve older state files if necessary.
To take it further, if you are worried about people with read access to your S3 buckets containing state you could add a bucket policy that explicitly denies access from anyone other than some whitelisted roles/users which would then be taken into account above and beyond any IAM access. Extending the example from a related AWS blog post we might have a bucket policy that looks something like this:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::MyTFStateFileBucket",
"arn:aws:s3:::MyTFStateFileBucket/*"
],
"Condition": {
"StringNotLike": {
"aws:userId": [
"AROAEXAMPLEID:*",
"AIDAEXAMPLEID"
]
}
}
}
]
}
Where AROAEXAMPLEID represents an example role ID and AIDAEXAMPLEID represents an example user ID. These can be found by running:
aws iam get-role -–role-name ROLE-NAME
and
aws iam get-user -–user-name USER-NAME
respectively.
If you really want to go down the encrypting the state file fully then you'd need to write a wrapper script that makes Terraform interact with the state file locally (rather than remotely) and then have your wrapper script manage the remote state, encrypting it before it is uploaded to S3 and decrypting it as it's pulled.

Resources