Capture Terraform provisioner output? - terraform

Use Case
Trying to provision a (Docker Swarm or Consul) cluster where initializing the cluster first occurs on one node, which generates some token, which then needs to be used by other nodes joining the cluster. Key thing being that nodes 1 and 2 shouldn't attempt to join the cluster until the join key has been generated by node 0.
Eg. on node 0, running docker swarm init ... will return a join token. Then on nodes 1 and 2, you'd need to pass that token to the same command, like docker swarm init ${JOIN_TOKEN} ${NODE_0_IP_ADDRESS}:{SOME_PORT}. And magic, you've got a neat little cluster...
Attempts So Far
Tried initializing all nodes with the AWS SDK installed, and storing the join key from node 0 on S3, then fetching that join key on other nodes. This is done via a null_resource with 'remote-exec' provisioners. Due to the way Terraform executes things in parallel, there are racy type conditions and predictably nodes 1 and 2 frequently attempt to fetch a key from S3 thats not there yet (eg. node 0 hasn't finished its stuff yet).
Tried using the 'local-exec' provisioner to SSH into node 0 and capture its join key output. This hasn't worked well or I sucked at doing it.
I've read the docs. And stack overflow. And Github issues, like this really long outstanding one. Thoroughly. If this has been solved elsewhere though, links appreciated!
PS - this is directly related to and is a smaller subset of this question, but wanted to re-ask it in order to focus the scope of the problem.

You can redirect the outputs to a file:
resource "null_resource" "shell" {
provisioner "local-exec" {
command = "uptime 2>stderr >stdout; echo $? >exitstatus"
}
}
and then read the stdout, stderr and exitstatus files with local_file
The problem is that if the files disappear, then terraform apply will fail.
In terraform 0.11 I made a workaround by reading the file with external data source and storing the results in a null_resource triggers (!)
resource "null_resource" "contents" {
triggers = {
stdout = "${data.external.read.result["stdout"]}"
stderr = "${data.external.read.result["stderr"]}"
exitstatus = "${data.external.read.result["exitstatus"]}"
}
lifecycle {
ignore_changes = [
"triggers",
]
}
}
But in 0.12 this can be replaced with file()
and then finally I can use / output those with:
output "stdout" {
value = "${chomp(null_resource.contents.triggers["stdout"])}"
}
See the module https://github.com/matti/terraform-shell-resource for full implementation

You can use external data:
data "external" "docker_token" {
program = ["/bin/bash", "-c" "echo \"{\\\"token\\\":\\\"$(docker swarm init...)\\\"}\""]
}
Then the token will be available as data.external.docker_token.result.token.
If you need to pass arguments in, you can use a script (e.g. relative to path.module). See https://www.terraform.io/docs/providers/external/data_source.html for details.

When I asked myself the same question, "Can I use output from a provisioner to feed into another resource's variables?", I went to the source for answers.
At this moment in time, provisioner results are simply streamed to terraform's standard out and never captured.
Given that you are running remote provisioners on both nodes, and you are trying to access values from S3 - I agree with this approach by the way, I would do the same - what you probably need to do is handle the race condition in your script with a sleep command, or by scheduling a script to run later with the at or cron or similar scheduling systems.
In general, Terraform wants to access all variables either up front, or as the result of a provider. Provisioners are not necessarily treated as first-class in Terraform. I'm not on the core team so I can't say why, but my speculation is that it reduces complexity to ignore provisioner results beyond success or failure, since provisioners are just scripts so their results are generally unstructured.
If you need more enhanced capabilities for setting up your instances, I suggest a dedicated tool for that purpose like Ansible, Chef, Puppet, etc. Terraform's focus is really on Infrastructure, rather than software components.

Simpler solution would be to provide the token yourself.
When creating the ACL token, simply pass in the ID value and consul will use that instead of generating one at random.

You could effectively run the docker swarm init step for node 0 as a Terraform External Data Source, and have it return JSON. Make the provisioning of the remaining nodes depend on this step and refer to the join token generated by the external data source.
https://www.terraform.io/docs/providers/external/data_source.html

With resource dependencies you can ensure that a resource is created before another.
Here's an incomplete example of how I create my consul cluster, just to give you an idea.
resource "aws_instance" "consul_1" {
user_data = <<EOF
#cloud-config
runcmd:
- 'docker pull consul:0.7.5'
- 'docker run -d -v /etc/localtime:/etc/localtime:ro -v $(pwd)/consul-data:/consul/data --restart=unless-stopped --net=host consul:0.7.5 agent -server -advertise=${self.private_ip} -bootstrap-expect=2 -datacenter=wordpress -log-level=info -data-dir=/consul/data'
EOF
}
resource "aws_instance" "consul_2" {
depends_on = ["aws_instance.consul_1"]
user_data = <<EOF
#cloud-config
runcmd:
- 'docker pull consul:0.7.5'
- 'docker run -d -v /etc/localtime:/etc/localtime:ro -v $(pwd)/consul-data:/consul/data --restart=unless-stopped --net=host consul:0.7.5 agent -server -advertise=${self.private_ip} -retry-join=${aws_instance.consul_1.private_ip} -datacenter=wordpress -log-level=info -data-dir=/consul/data'
EOF
}
For the docker swarm setup I think it's out of Terraform scope and I think it should because the token isn't an attribute of the infrastructure you are creating. So I agree with nbering, you could try to achieve that setup with a tool like Ansible or Chef.
But anyways, if the example helps you to setup your consul cluster I think you just need to configure consul as your docker swarm backend.

Sparrowform - is a lightweight provisioner for Terraform based infrastructure can handle your case. Here is example for aws ec2 instances.
Assuming we have 3 ec2 instances for consul cluster: node0, node1 and node2. The first one (node0) is where we fetch token from and keep it in S3 bucket. The other two ones load token later from S3.
$ nano aws_instance.node0.sparrowfile
#!/usr/bin/env perl6
# have not checked this command, but that's the idea ...
bash "docker swarm init | aws s3 cp - s3://alexey-bucket/stream.txt"
$ nano aws_instance.node1.sparrowfile
#!/usr/bin/env perl6
my $i=0;
my $token;
try {
while True {
my $s3-token = run 'aws', 's3', 'cp', 's3://alexey-bucket/stream.txt', '-', :out;
$token = $s3-token.out.lines[0];
$s3-token.out.close;
last if $i++ > 8 or $token;
say "retry num $i ...";
sleep 2*$i;
}
CATCH { { .resume } }
}
die "we have not succeed in fetching token" unless $token;
bash "docker swarm init $token";
$ nano aws_instance.node2.sparrowfile - the same setup as for node1
$ terrafrom apply # bootstrap infrastructure
$ sparrowform --ssh_private_key=~/.ssh/aws.pub --ssh_user=ec2-user # run provisioning on node0, node1, node2
PS disclosure, I am the tool author.

Related

Capturing kubectl set command in terraform

We have a case where we need to update AWS EKS CNI config on the daemon set. But the solution is only through kubectl command. How do we update an existing daemonset with specific values through terraform code? The requirement is that the solution has to be in IAC. The equivalent kubectl command given is
kubectl set env daemonset -n kube-system aws-node WARM_IP_TARGET=2,MINIMUM_IP_TARGET=12
The values shown in numbers are planned to be variables in terraform.
What you are asking for doesn't exist. Here is the open Terraform Github issue for what you are asking for:
https://github.com/hashicorp/terraform-provider-kubernetes/issues/723
Even if that did exist, I wouldn't consider that IaC as it's not declarative (might as well just run a bash script).
In my opinion, the real solution is for AWS to allow the provisioning of bare clusters so that "addons" can be managed completely through IaC tools. But that also does not exist:
https://github.com/aws/containers-roadmap/issues/923
The closest you're going to get will be to use a null_resource to execute the patch. Here's an example in that Github issue:
https://github.com/hashicorp/terraform-provider-kubernetes/issues/723#issuecomment-679423792
So your final result will look similar to this:
resource "null_resource" "patch_aws_cni" {
triggers = {
always_run = timestamp()
}
provisioner "local-exec" {
command = <<EOF
# do all those commands to get kubectl and auth info, then run:
kubectl set env daemonset -n kube-system aws-node WARM_IP_TARGET=2,MINIMUM_IP_TARGET=12
EOF
}
}

Azure + Terraform + Grabbing a variable and passing it along

Hopefully someone can push me in the right direction.
I have a Terraform plan that currently stands up a Linux VM in Azure. I am attempting to run a bash script to install a software client.
It appears the azurerm provider does not support
user_data
rather it supports
custom_data
Am I correct in this statement?
That being said, what I am trying to do as well is this.
Reference instance of the software client is setup within a web portal. The web portal creates a token for this reference instance. That token is then used when installing the software client within the Linux VM.
My code for running the bash script is as follows:
> custom_data = <<USERDATA
> #!/bin/bash -xe
> curl -J -O -L https://app.strongdm.com/releases/cli/linux && unzip sdmcli* && rm -f sdmcli*
> sudo ./sdm install --relay
> USERDATA
I get an error however when running terraform apply
$ terraform apply
Error: expected "custom_data" to be a base64 string, got
#!/bin/bash -xe
curl -J -O -L https://app.strongdm.com/releases/cli/linux && unzip sdmcli* && rm -f sdmcli*
sudo ./sdm install --relay
Here are my questions:
Would I use something like a key vault to hold that token and then
pull it when the bash script runs?
Is there a better way of passing
that token?
Can you pass things of that nature in terraform like
variables?
Am I trying to run my bash script in the correct place?
The bash script is being run within the
resource "azurerm_linux_virtual_machine" "tfssh1" {
block of code.
UPDATE
Log into Admin UI
Create a new instance - Token is generated. Copy this token and save it somewhere.
Run install on Linux VM
Installation prompts for token saved earlier
Token is input
Installation completes
Admin UI now knows this server matches this instance
I just found this within the provider
output "gateway_token" {
value = sdm_node.my_gateway.gateway[0].token
sensitive = true
}
That is outputting the token in question. I should be able to grab that within my bash script. Now I just need to figure out the correct way to run said bash script through terraform.
You need to use base64encode
custom_data = base64encode(data.local_file.cloudinit.content)
data "local_file" "cloudinit" {
filename = "${path.module}/cloud-init.conf"
}

check if Kubernetes deployment was sucessful in CI/CD pipeline

I have an AKS cluster with Kubernetes version 1.14.7.
I have build CI/CD pipelines to deploy newly created images to the cluster.
I am using kubectl apply to update a specific deployment with the new image. sometimes and for many reasons, the deployment fails, for example ImagePullBackOff.
is there a command to run after the kubectl apply command to check if the pod creation and deployment was successful?
For this purpose Kubernetes has kubectl rollout and you should use option status.
By default 'rollout status' will watch the status of the latest rollout until it's done. If you don't want to wait for the rollout to finish then you can use --watch=false. Note that if a new rollout starts in-between, then 'rollout status' will continue watching the latest revision. If you want to pin to a specific revision and abort if it is rolled over by another revision, use --revision=N where N is the revision you need to watch for.
You can read the full description here
If you use kubect apply -f myapp.yaml and check rollout status you will see:
$ kubectl rollout status deployment myapp
Waiting for deployment "myapp" rollout to finish: 0 of 3 updated replicas are available…
Waiting for deployment "myapp" rollout to finish: 1 of 3 updated replicas are available…
Waiting for deployment "myapp" rollout to finish: 2 of 3 updated replicas are available…
deployment "myapp" successfully rolled out
There is another way to wait for deployment to become available with a configured timeout like
kubectl wait --for=condition=available --timeout=60s deploy/myapp
otherwise kubectl rollout status can be used but it may stuck forever in some rare cases and will require manual cancellation of pipeline if that happens.
You can parse the output through jq:
kubectl get pod -o=json | jq '.items[]|select(any( .status.containerStatuses[]; .state.waiting.reason=="ImagePullBackOff"))|.metadata.name'
It looks like kubediff tool is a perfect match for your task:
Kubediff is a tool for Kubernetes to show you the differences between your running configuration and your version controlled configuration.
The tool can be used from the command line and as a Pod in the cluster that continuously compares YAML files in the configured repository with the current state of the cluster.
$ ./kubediff
Usage: kubediff [options] <dir/file>...
Compare yaml files in <dir> to running state in kubernetes and print the
differences. This is useful to ensure you have applied all your changes to the
appropriate environment. This tools runs kubectl, so unless your
~/.kube/config is configured for the correct environment, you will need to
supply the kubeconfig for the appropriate environment.
kubediff returns the status to stdout and non-zero exit code when difference is found. You can change this behavior using command line arguments.
You may also want to check the good article about validating YAML files:
Validating Kubernetes Deployment YAMLs

Terraform local-exec Provisioner to run on multiple Azure virtual machines

I had a working TF setup to spin up multiple Linux VMs in Azure. I was running a local-exec provisioner in a null_resource to execute an Ansible playbook. I was extracting the private IP addresses from the TF state file. The state file was stored locally.
I have recently configured Azure backend and now the state file is stored in a storage account.
I have modified the local provisioner and am trying to obtain all the private IP addresses to run the Ansible playbook against, as follows:
resource "null_resource" "Ansible4Ubuntu" {
provisioner "local-exec" {
command = "sleep 20;ansible-playbook -i '${element(azurerm_network_interface.unic.*.private_ip_address, count.index)}', vmlinux-playbook.yml"
I have also tried:
resource "null_resource" "Ansible4Ubuntu" {
provisioner "local-exec" {
command = "sleep 20;ansible-playbook -i '${azurerm_network_interface.unic.private_ip_address}', vmlinux-playbook.yml"
They both work fine with the first VM only and ignores the rest. I have also tried with count.index+1 and self.private_ip_address, but no luck.
Actual result: TF provides the private IP of only the first VM to Ansible.
Expected result: TF to provide a list of all private IPs to Ansible so that it can run the playbook against all of them.
PS: I am also looking at using the TF's remote_state data structure, but seems like the state file contains IPs from previous builds as well, making it hard to extract the ones good for the current build.
I would appreciate any help.
Thanks
Asghar
As Matt said, the null_resource just run one time, so it just works fine with the first VM and ignores the rest. You need to configure triggers for the null_resource with the NIC list to make it run multiple times. Sample code like this:
resource "null_resource" "Ansible4Ubuntu" {
triggers = {
network_interface_ids = "${join(",", azurerm_network_interface.unic.*.id)}"
}
provisioner "local-exec" {
command = "sleep 20;ansible-playbook -i '${join(" ", azurerm_network_interface.unic.*.private_ip_address)}, vmlinux-playbook.yml"
}
}
You can change something in it as you want. For information, see null_resource.

Terraform state environments/workspaces not listed when created on another machine

I've been using Terraform's state environments (soon to be renamed as workspaces) as part of a CI system (Gitlab CI) to spin up dynamic environments for each branch for tests to run against.
This seems to be working fine but as part of the tear down of the environment after the branch is deleted I am also trying to use terraform env delete [ENVIRONMENT NAME]. When ran locally this is fine but my CI system is running in Docker and so has a clean workspace between creating the environment and then later on a build stage destroying it. In this case it can't seem to see the environment.
If I try to delete it I see this error:
Environment "restrict-dev-websites-internally" doesn't exist!
You can create this environment with the "new" option.
terraform env list also doesn't show the environment.
I've also noticed that I'm unable to select it despite seeing it in S3 (where my remote state is stored). If I create a new environment called the same thing then the environment from my remote state is used (it doesn't try to create another set of resources).
On top of this, when I'm using an environment created by the CI system I notice that sometimes I have an environment selected that terraform env list doesn't show:
$ terraform env list
default
$ cat .terraform/environment
[ENVIRONMENT NAME]
$ terraform env list
default
Note the missing * against the selected environment and that my environment isn't listed as would be expected by the example in the docs:
$ terraform env list
default
* development
mitchellh-test
I'm unsure as to how the state environments are meant to be working so I may have missed a trick here which is causing this odd corruption when working in Docker.
For completeness I'm managing the environments using some wrapper scripts:
env.sh
#!/bin/sh
set -e
if [ "$#" -ne 2 ]; then
echo "Usage: ./env.sh terraform_target env_name"
echo ""
echo "Example: ./env.sh test test-branch"
fi
TERRAFORM_TARGET_LOCATION=${1}
TERRAFORM_ENV=${2}
REPO_BASE=`git rev-parse --show-toplevel`
TERRAFORM_BASE="${REPO_BASE}"/terraform
. "${TERRAFORM_BASE}"/remote.sh "${TERRAFORM_BASE}"/"${TERRAFORM_TARGET_LOCATION}"
if ! terraform env select ${TERRAFORM_ENV} 2> /dev/null; then
terraform env new ${TERRAFORM_ENV}
fi
env-delete.sh
#!/bin/sh
set -e
if [ "$#" -ne 2 ]; then
echo "Usage: ./env.sh terraform_target env_name"
echo ""
echo "Example: ./env.sh test test-branch"
fi
TERRAFORM_TARGET_LOCATION=${1}
TERRAFORM_ENV=${2}
REPO_BASE=`git rev-parse --show-toplevel`
TERRAFORM_BASE="${REPO_BASE}"/terraform
. "${TERRAFORM_BASE}"/remote.sh "${TERRAFORM_BASE}"/"${TERRAFORM_TARGET_LOCATION}"
if terraform env select ${TERRAFORM_ENV} 2> /dev/null; then
terraform env select default
terraform env delete ${TERRAFORM_ENV}
fi
The remote.sh script runs a terraform init with dynamic state file locations depending on the project and path in the project using S3 as a backend.
remote.sh
#!/bin/sh
set -e
terraform --version
TERRAFORM_TARGET_LOCATION="${1}"
cd "${TERRAFORM_TARGET_LOCATION}"
REPO_NAME="$(basename "`git config --get remote.origin.url`" .git)"
STATE_BUCKET="<BUCKET_NAME>"
STATE_KEY="$(git rev-parse --show-prefix | cut -d"/" -f2-)"
STATE_FILE="terraform.tfstate"
terraform init -backend-config="bucket=${STATE_BUCKET}" \
-backend-config="key=${STATE_KEY}/${STATE_FILE}"
terraform get -update=true
When running things locally I have very wide permissions which include full access to all of S3. My Gitlab CI instances use the following IAM privileges attached to an instance profile:
{
"Version" : "2012-10-17",
"Statement": [
{
"Sid" : "1",
"Effect" : "Allow",
"Action" : [ "s3:List*",
"s3:Get*",
"s3:PutObject*" ],
"Resource": [ "arn:aws:s3:::<BUCKET_NAME>",
"arn:aws:s3:::<BUCKET_NAME>/*" ]
},
{
"Sid": "2",
"Effect": "Allow",
"Action": [
"s3:DeleteObject*"
],
"Resource": [
"arn:aws:s3:::<BUCKET_NAME>/env:*"
]
}
]
}
For clarity, my builds can see and use the remote state for the environment fine but are forced to created the environment over and over and are then unable to delete the environment after destroying everything in the state file because it can't select the environment.
I could always create the environment before deleting it so that I have it available in terraform env list but the point is that I'm not sure why the environment is not in the list when the environment was created on another machine or in another container.
You need the state to destroy the environment.
From the documentation as to why they need the state:
Terraform typically uses the configuration to determine dependency
order. However, when you delete a resource from a Terraform
configuration, Terraform must know how to delete that resource.
Terraform can see that a mapping exists for a resource not in your
configuration and plan to destroy. However, since the configuration no
longer exists, it no longer knows the proper destruction order.
You might also try to first import it which might be doable if there are not many instances involved.
I'd suggest considering running a consul container (or 3 to make it stable, they are really small) to store the state in another remote store than your default s3 store. This will make sure your CI environments will not show up in the remote store used by others. Consul has a web gui that will allow you to clean up the K/V stored there if it is ever needed. You can also interact with it through their api using curl or Ansible.
Alternatively, you can make the consul server part of the dev environment you set up, store the state there and read from it when destroying. In that case you would still keep everything else clean. I'd personally do it like this.
If a dev starts an environment on his local machine and you want to keep your remote state clean he should be using the local state. You can also use the solution above for the dev's local machine and have a consul server inside his local setup. Good for him to create/destroy and you will keep you remote state clean as you say.
As a disclaimer, I only started recently with Terraform but I don't quite see the advantage of the environments. I'm using a git repo with subdirs for each environment. That way they are truly independent from each other and I can set a dev locally and our staging/prod on our consul cluster protected with ACL.
Didn't you try the Volumes option in runner configuration?
https://gitlab.com/gitlab-org/gitlab-ci-multi-runner/blob/master/docs/configuration/advanced-configuration.md#the-runners-docker-section
To save the terraform state between builds?

Resources