I have a Terraform script that, after creating several nodes, installs a piece of software on them by telling one node about the other ones using cloud-init. I want this cloud-init piece to only run when this one node is initially created, and not if it's altered or re-created. Now thanks to Terraform plan, Terraform has the information needed, it's telling the user clearly if the node has to be re-created or if it's created the first time at all. But how do I get this kinda information in my script?
The (boring) and manual way is of course to make it a variable that a human enters after reviewing the plan. But this is hardly scalable, secure or sophisticated.
Maybe Terraform is entirely the wrong tool for this kinda job?
Terraform does not expose the information about what action is planned for an object for use in the configuration itself, because Terraform is a desired state system and so the actions are derived from the configuration, rather than the configuration being derived from the actions.
To achieve what you described I think you'll need to arrange for the initialization process to itself remember that it already run in some persistent location. Cloud-init itself remembers when it has run to completion so that rebooting the system won't re-run initialization tasks, but of course that information cannot survive replacing the VM entirely and so you'd need to create a similar marker yourself in a data store that will outlive that particular VM.
One unanswered question down that path is how that external state would get cleaned up if you were to destroy the VM entirely, since the software running in the VM can't tell whether the system is being shut down in preparation for replacement or being shut down in response to just destroying.
Related
I have a module that creates some VMs (along with a VApp) and I need to enable users of that module to pass scripts that would get run on creation and/or deletion of each VM. For example, to enroll the VM into a company authentication system, CMDB, monitoring system or other.
And I just cannot find a way to implement it nicely.
There are several approaches that I've tried:
Passing a script that gets passed directly to commands inside provisioner blocks on each VM resource.
Advantages of this approach:
Clean and simple - there are no additional resources.
Works as intended - runs on creation and/or deletion.
The problems with this approach:
You can only pass one script per VM.
You could concatenate them together, but that has obvious drawbacks like the failure in one is then a failure in everything leaving things in an inconsistent state. Retrying gets pretty messy.
It's somewhat ugly since if the command is null, Terraform crashes.
As a workaround, you could pass a one-liner containing only a comment about why it's there but that doesn't exactly make it pretty.
Creating a separate null_resource for each VM-script combination and putting the script there.
Advantages of this approach:
There is a separate resource for each script for each VM that if failed could be retried.
The problems with this approach:
Problems arise when updating the scripts.
Creation script updates can be ignored with a lifecycle ignore statement since if you're changing the enroll procedure after a machine has already been succesfully enrolled, you don't need to enroll it again.
But changing a when = destroy script results in the resource being recreated. Meaning it being destroyed and created with the new version of a script. Which in turn triggers the run of the old script which is very unlikely to be desirable.
Creating "runner" null_resources that only contain the name of a script to run. The script and its content is managed through a set of local_file resources.
Advantages of this approach:
Script content is separated from script execution.
The problems with this approach:
This falls apart in ephemeral environments (like running Terraform in a new container each time) where the files would have to be created each time. Maybe this can be worked around though seems it might be difficult because depends_on fails if something other than a static resource is used - fails when trying to use resource subscripts.
Additional considerations:
Using mechanisms of pushing those scripts to VMs and having them run there is undesirable because:
Additional risk of having to handle secrets passed to a lot of machines.
There is the assumption that the machines can also reach the APIs that is not always true.
The hack of putting information from other resources in triggers (because a when = destroy provisioner can only use references to self) is a bit worrisome because relying on it means everything will break completely if Terraform developers decide to remove it entirely.
I have implemented a kubernetes terraform provider which applies manifest files to k8s cluster. I have created .tf files also but when I run terraform init it downloads plugins from terraform registry.
How can I make my plugin to run for terraform apply.
Debugging a TerraForm provider
Understanding the design
In order to do it, you first have to understand how Go builds apps, and then how terraform works with it.
Every terraform provider is a sort of module. In order to support an open, modular system, in almost any language, you need to be able to dynamically load modules and interact with them. Terraform is no exception.
However, the go lang team long ago decided to compile to statically linked applications;
any dependencies you have will be compiled into 1 single binary. Unlike in other native languages (like C, or C++), a
.dll or .so is not used; there is no dynamic library to load at runtime and thus, modularity becomes a whole other trick.
This is done to avoid the notorious dll hell that was so common up until most modern systems included some
kind of dependency management. And yes, it can still be an issue.
Every terraform provider is its own mini RPC server. When terraform runs your provider, it actually starts a new process that is your provider, and connects to it through
this RPC channel. Compounding the problem is that the lifetime of your provider process is very much
ephemeral; potentially lasting no more and a few seconds. It's this process you need to connect to with your debugger
Normal debugging
Normally, you would directly spin-up your app, and it would load modules into application memory. That's why you can actually
debug it, because your debugger knows how to find the exact memory address for your provider. However, you don't have
this arrangement, and you need to do a remote debug session.
The conundrum
So, you don't load terraform directly, and even if you did, your module (a.k.a your provider) is in the memory
space of an entirely different process; and that lasts no more than a few seconds, potentially.
The solution
You need the debugging tool delve.
You are going to have to place a little bit of shim code close to the spot in the code where you want to begin
debugging. We need to stop this provider process from exiting before we can connect. So, put this bit of code in place:
connected := false
for !connected {
time.Sleep(time.Second) // set breakpoint here
}
This code effectively creates an infinite sleep loop; but that's actually essential to solving the problem.
Place a break point right inside this loop. It won't do anything, yet.
Now run the terraform commands you need to, to engage the code you're desiring to debug. Upon doing so,
terraform will basically stop, as it waits on a response from you provider; because you put an infinite sleep loop in
You must now tell delve to connect to this remote process using it's PID. This isn't as hard as it seems.
Run this commands:
dlv --listen=:2345 --headless=true --api-version=2 --accept-multiclient attach $(pgrep terraform-provider-artifactory)
The last argument gets the PID for your provider and supplies it to delve to connect. Immediately upon running this
command, you're going to hit your break point. Please make sure to substitute terraform-provider-artifactory for your provider name
To exit this infinite loop, use your debugger to set connected to true. By doing so you change the loop predicate
and it will exit this loop on the next iteration.
DEBUG! - At this point you can, step, watch, drop the call stack, etc. Your whole arsenel is available
Is it possible to fail an aws_instance creation if script passed into user_data fails to run? e.g., with exit 1?
I have a null_resource that uses depends_on [aws_instance.myVM] to post a JSON to the instance, and I need that depends_on to fail when the user_data script fails.
Thanks!
user_data is handled by software running inside the instance itself, such as cloud-init, and so its processing is asynchronous from the ec2:RunInstances call that Terraform's AWS provider makes to start the instance running. There is therefore no way to feed back status information from the user_data handling because it could potentially be running some time after the EC2 API starts reporting that the instance is "running", depending on where in the boot process it's dealt with.
Also, depends_on is for ordering rather than for handling errors, so a depends_on clause will never change anything about Terraform's error handling.
If you want to run software on your instance synchronously as part of Terraform's create operation then unfortunately the only practical option is to use the remote-exec provisioners. This comes at the expense of some considerable extra complexity, because Terraform must now be able to open an SSH session with the instance and authenticate with it to create a two-way communications channel.
In return for that complexity, Terraform can be the one to run the code in question and so Terraform can detect whether it succeeded or not (using its exit status). If the remote command fails, Terraform will halt further processing, return an error, and mark the instance as tainted so that the next plan will attempt to create it again.
With that said, it may be better to find a different way to achieve your goal that doesn't require coupling the Terraform run directly to the state of the software running in the virtual machine. It's not usually expected that a Terraform configuration should need to both start up a virtual machine and interact with software running in that virtual machine in the same operation. This could be a good point for some system decomposition, where you'd have one Terraform configuration that declares that the virtual machine should exist and then a second configuration that assumes that the virtual machine exists and takes actions against it. You can then run whatever other software you need to run in between those two, in order to check whether the instance started up successfully.
One not so obvious way I am doing this is I register the instance as a consul service at the end of user data. If the service never arrives, downstream polling for arrival of the consul service can also timeout and fail. This could be located in another local exec function which could be used as a dependency, or it could be in other instance user data as well.
Perhaps a better way would be to use the consul KV store though, and not to just rely on service arrival. The KV store could be used to provide more information than just a binary state.
I usually run all my Terraform scripts through Bastion server and all my code including the tf statefile resides on the same server. There happened this incident where my machine accidentally went down (hard reboot) and somehow the root filesystem got corrupted. Now my statefile is gone but my resources still exist and are running. I don't want to again run terraform apply to recreate the whole environment with a downtime. What's the best way to recover from this mess and what can be done so that this doesn't get repeated in future.
I have already taken a look at terraform refresh and terraform import. But are there any better ways to do this ?
and all my code including the tf statefile resides on the same server.
As you don't have .backup file, I'm not sure if you can recover the statefile smoothly in terraform way, do let me know if you find a way :) . However you can take few step which will help you come out from situation like this.
The best practice is keep all your statefiles in some remote storage like S3 or Blob and configure your backend accordingly so that each time you destroy or create a new stack, it will always contact the statefile remotely.
On top of it, you can take the advantage of terraform workspace to avoid the mess of statefile in multi environment scenario. Also consider creating a plan for backtracking and versioning of previous deployments.
terraform plan -var-file "" -out "" -target=module.<blue/green>
what can be done so that this doesn't get repeated in future.
Terraform blue-green deployment is the answer to your question. We implemented this model quite a while and it's running smoothly. The whole idea is modularity and reusability, same templates is working for 5 different component with different architecture without any downtime(The core template remains same and variable files is different).
We are taking advantage of Terraform module. We have two module called blue and green, you can name anything. At any given point of time either blue or green will be taking traffic. If we have some changes to deploy we will bring the alternative stack based on state output( targeted module based on terraform state), auto validate it then move the traffic to the new stack and destroy the old one.
Here is an article you can keep as reference but this exactly doesn't reflect what we do nevertheless good to start with.
Please see this blog post, which, unfortunately, illustrates import being the only solution.
If you are still unable to recover the terraform state. You can create a blueprint of terraform configuration as well as state for a specific aws resources using terraforming But it requires some manual effort to edit the state for managing the resources back. You can have this state file, run terraform plan and compare its output with your infrastructure. It is good to have remote state especially using any object stores like aws s3 or key value store like consul. It has support for locking the state when multiple transactions happened at a same time. Backing up process is also quite simple.
I'm using the ibm_installation_manager module from the puppet forge and it is a bit basic because IBM wrote Installation Manager in a time where idempotency was done much.
ref: https://forge.puppet.com/puppetlabs/ibm_installation_manager
As such it does not cater nicely for upgrades - so the module will not detect if an upgrade is needed, stop existing processes, do the upgrade and then start the processes again. It will just detect if an upgrade is needed and try to install the desired version and if that constitutes an upgrade that's great, but it will probably fail due to running instances.
So I need to implement some "stop processes" pre-upgrade functionality.
I need to mention at this point I'm new to ruby and fairly new to puppet.
The provider that the module uses (imcl.rb) has an exists method.
The ideal way for me to detect if an upgrade is going to happen (and stop the instances if it is) would be for my puppet manifest to be able to somehow call the exists method. Is this possible?
Or how would you approach this problem?
Something like imcl.exists(ibm_pkg["my_imcl_pkg_resource"])
The ideal way for me to detect if an upgrade is going to happen (and stop the instances if it is) would be for my puppet manifest to be able to somehow call the exists method. Is this possible?
No, it is not possible, at least not in any useful way. Your manifests describe how to build a catalog of resources describing the target state of the machine. In a master / agent setup, this happens on the master. The catalog is then used as input to a separate step, in which it is transferred to the target machine and applied there. It is in this second step that providers are engaged.
To the extent that you want the contents of your catalogs to be influenced by the current state of the target machine, the Puppet mechanism for that is to convey the needed state details to the catalog builder in the form of facts. It is relatively straightforward to add your own facts. Indeed, there are at least two distinct, non-exclusive mechanisms, going under the names "external facts" and "custom facts".