how to replace an instance created by Terraform - terraform

I created multiple nodes using terraform and then I deployed these nodes as a cluster using ansible.
resource "google_compute_instance" "cluster"
count = 6
machine_type = "e2.micro"
...
}
Now suppose one of the nodes has some issue such as hardware issue so I have to destroy it and launch another node and then deploy it using Ansible.
How can I destroy it and then launch a new one with that same Terraform code? Using the above Terraform I only know how to add a new node by changing the count to 7.
Besides, is there any way I change the instance type of one of above node? the use case is sometimes one of the nodes is out of memory so I want to increase the instance type of this node (maybe temporary)

You can also create the AMI using Packer (another Hashicorp tool). Put that AMI into a Launch Configuration. And then put that Launch Configuration into an Auto Scaling Group (all of this done in Terraform of course). That way you can simply update the AMI value in Launch Config when you want to update AMI.

You're looking for AWS auto scaling groups (ASG): https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/autoscaling_group
set the desired capacity to 6 and define a launch template with t3.micro instance type.

Related

Kafka connect consumer Group ID doesn't set in Fargate cluster

I'm using the Debezium PostgreSQL connector to send my PostgreSQL data to Kafka. I set all configs correctly and it's working as expected in the local environment with docker-compose. Then we used Terraform to automate deployment to the AWS Fargate cluster. Terraform scripts also worked fine and launched all the required infrastructures. Then comes the problem;
The connector doesn't start in Fargate and logs shows GROUP_ID=1. ( This is set correctly in local with docker-compose GROUP_ID=connect-group-dev )
I provide the GROUP_ID as connect-group-dev in environment variables but that is not reflected in to the Fargate cluster container, however in the AWS UI, I can see that GROUP_ID is set to connect-group-dev.
All other environment variables are reflected in to the container.
I suspect the problem is that GROUP_ID is not getting by the container when it's starting the Kafka Connector, but in a later step, it is set to the container. ( because I can see the correct value in AWS UI in the Task Definition )
Is the default value is 1 for GROUP_ID? (since I don't set any variable to 1 )
This is a weird situation and double-check all the files, but still cannot find a reason for this. Any help would be great.
I'd recommend you use MSK Connect rather than Fargate, but assuming you are using Debezium Docker container, then yes GROUP_ID=1 is the default
If you are not using Debezium container, then that would explain why the variable is not set at runtime.

How to scale up Kubernetes cluster with Terraform avoiding downtime?

Here's the scenario: we have some applications running on a Kubernetes cluster on Azure. Currently our production cluster has one Nodepool with 3 nodes which are fairly low on resources because we still don't have that many active users/requests simultaneously.
Our backend APIs app is running on three pods, one on each node. I was told I will have need to increase resources soon (I'm thinking more memory or even replacing the VMs of the nodes with better ones).
We structured everything Kubernetes related using Terraform and I know that replacing VMs in a node is a destructive action, meaning the cluster will have to be replaces, new config and all deployments, services and etc will have to be reapplied.
I am fairly new to the Kubernetes and Terraform world, meaning I can do the basics to get an application up and running but I would like to learn what is the best practice when it comes to scaling and performance. How can I perform such increase in resources without having any downtime of our services?
I'm wondering if having an extra Nodepool would help while I replace the VM's of the other one (I might be absolutely wrong here)
If there's any link, course, tutorial you can point me to it's highly appreciated.
(Moved from comments)
In Azure, when you're performing cluster upgrade, there's a parameter called "max surge count" which is equal to 1 by default. What it means is when you update your cluster or node configuration, it will first create one extra node with the updated configuration - and only then it will safely drain and remove one of old ones. More on this here: Azure - Node Surge Upgrade

How to change the node name in the node pool of GKE Cluster in terraform?

is there any way to change the node names in the node pool of GKE cluster that was provisioned with terraform?
Currently, the format of it is the following:
gke-proj-k8s-qa-proj-k8s-qa-n-998c055f-g74g
I would like to change it to something more meaningful like:
proj-k8s-qa-pool-a-1
Thank you.
As I recall, if you want to change the node name in k8s, you actually need to remove the node, change the manifest and rejoin it to the cluster. This is not possible with GKE.
I'd suggest adding a custom label (e.g. short_name) to each node.

Is it possible to delete and re-create GKE nodepool with new machine type that is managed through terraform?

I want to change the machine type on my gke nodepool to better match my cpu vs memory usage, but I am having a lot of trouble with getting terraform to delete this nodepool and re-create it. I am locked into having a default nodepool because of the module that has been used in the past to create the cluster. The cluster is in a shared module with the nodepool, so I cannot permanently delete the nodepool through terraform without also deleting the cluster which would affect everybody that is expecting the cluster to stay available.
So my solution was to create an additional and temporary nodepool, migrate all pods to it, cordon and drain the default nodepool, then through terraform, change the nodepool's machine type so that it would recreate without affecting any running deployments, pods, etc. However, terraform did not attempt to delete the nodepool, only to re-create it. Therefore it failed with a 409, nodepool already exists.
My question is, can I delete a nodepool manually - through gcloud commands or other such methods - and then re-run terraform and hopefully not experience the 409 (nodepool already exists) error? Are there any consequences this could have on the terraform state file? Would terraform fail completely if I deleted a resource (the nodepool) that it was expecting to exist?
Note - I did my best to include all information, but if there's more info needed please let me know and I will attempt to edit this and add more info. Thanks.
I have found an example on how to Migrate workloads to different machine types
Basically you did the correct actions.
Creating a node pool with a large machine type in parallel.
Migrating the workloads
Cordon and drain the nodes from the nodepool that you want to delete.
Drain the nodes from the new nodepool
And then Delete the old nodepool.
Basically is what you already did:
My solution was to create an additional and temporary nodepool, migrate all pods to it, cordon and drain the default nodepool
But if you want to re-apply your TF after doing manual operations, then you are probably going to have a lot of errors since TF saves a status.
what I think you would have to do is all the steps above but with TF so in that way it does not lose consistency with its status.
Especially since the image change is not such a transparent operation for GKE, that's why the documentation recommends creating a new nodepool.
Take in consideration that Terraform will not let you manage your cluster (cordon, drain, etc.)
that must be by hand.
If your issue is the default terraform behaviour of destroying resources before creating new resources you could try to use the 'lifecycle' meta-argument in your terraform configuration.
resource "google_container_node_pool" "example" {
# ...
lifecycle {
create_before_destroy = true
}
}
This would ensure that your node pool which you want to replace stays up until the replacement pool with new machine type has been created.

Resizing the Managed AKS cluster node size or type destroys and recreates the cluster with Terraform

I am trying to resize the AKS cluster Nodes with Terraform, but everytime i does this the terraform destroys and recreates the cluster. I am able to resize the Cluster without disturbing the environment from Portal.
Is this expected?
yes this is expected, AKS node size is immutable (at least it was last time I checked). if its not now - means terraform is relying on the outdated SDK calls.
btw, i cant see how you resize it, scaling works, resizing doesnt. scaling should happen without recreating

Resources