I am using Terraform to deploy an ECS EC2 cluster on AWS.
My pipeline is creating a new task definition from docker-compose, then updates the service to use this task definition.
Desired count is 1, deployment_minimum_healthy_percent is 100 and deployment_maximum_percent is 200.
Expected behavior is that the autoscaling group will launch a new EC2 instance to deploy the new task, then kill the old one.
What happens is that I get an error message : "service X was unable to place a task because no container instance met all of its requirements. The closest matching container-instance has insufficient memory available."
No instance is created and the deployment is rolled back. How can I make sure that an extra instance is created when deploying my service ?
Here is my Terraform code :
resource "aws_ecs_cluster" "main_cluster" {
name = var.application_name
tags = {
Environment = var.environment_name
}
}
data "aws_ecs_task_definition" "main_td" {
task_definition = var.application_name
}
resource "aws_ecs_service" "main_service" {
name = var.application_name
cluster = aws_ecs_cluster.main_cluster.id
launch_type = "EC2"
scheduling_strategy = "REPLICA"
task_definition = "${data.aws_ecs_task_definition.main_td.family}:${data.aws_ecs_task_definition.main_td.revision}"
desired_count = var.target_capacity
deployment_minimum_healthy_percent = 100
deployment_maximum_percent = 200
health_check_grace_period_seconds = 10
wait_for_steady_state = false
force_new_deployment = true
load_balancer {
target_group_arn = aws_lb_target_group.main_tg.arn
container_name = var.container_name
container_port = var.container_port
}
ordered_placement_strategy {
type = "binpack"
field = "memory"
}
deployment_circuit_breaker {
enable = true
rollback = true
}
lifecycle {
ignore_changes = [desired_count]
}
tags = {
Environment = var.environment_name
}
}
Auto-scaling group :
data "template_file" "user_data" {
template = "${file("${path.module}/user_data.sh")}"
vars = {
ecs_cluster = "${aws_ecs_cluster.main_cluster.name}"
}
}
resource "aws_launch_configuration" "main_lc" {
name = var.application_name
image_id = var.ami_id
instance_type = var.instance_type
associate_public_ip_address = true
iam_instance_profile = "arn:aws:iam::812844034365:instance-profile/ecsInstanceRole"
security_groups = ["${aws_security_group.main_sg.id}"]
root_block_device {
volume_size = "30"
volume_type = "gp3"
}
user_data = "${data.template_file.user_data.rendered}"
}
resource "aws_autoscaling_policy" "main_asg_policy" {
name = "${var.application_name}-cpu-scale-policy"
policy_type = "TargetTrackingScaling"
autoscaling_group_name = aws_autoscaling_group.main_asg.name
estimated_instance_warmup = 10
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
target_value = 40.0
}
}
resource "aws_autoscaling_group" "main_asg" {
name = var.application_name
launch_configuration = aws_launch_configuration.main_lc.name
min_size = var.target_capacity
max_size = var.target_capacity * 2
health_check_type = "EC2"
health_check_grace_period = 10
default_cooldown = 30
desired_capacity = var.target_capacity
vpc_zone_identifier = data.aws_subnet_ids.subnets.ids
wait_for_capacity_timeout = "3m"
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 100
}
}
}
Module is published here https://registry.terraform.io/modules/hboisgibault/ecs-cluster/aws/latest
Related
I am using the following terraform to create windows EC2 instance. The instance is launched successfully, but in AWS console I see hyphenated name for EC2 instance.
For brevity purpose I have removed some TF code
resource "aws_launch_template" "server_launch_template" {
name = "my-launch-template"
image_id = "my-windows-ami-id"
instance_type = "t3.medium"
key_name = "my-keypair"
vpc_security_group_ids = [var.security_group_id]
iam_instance_profile {
arn = aws_iam_instance_profile.my_instance.arn
}
tag_specifications {
resource_type = "instance"
tags = module.tags.mytags
}
lifecycle {
create_before_destroy = true
}
}
resource "aws_autoscaling_group" "server_autoscaling_group" {
name = "my autoscaling group"
max_size = 1
min_size = 1
desired_capacity = 1
vpc_zone_identifier = [var.subnet_id]
wait_for_capacity_timeout = var.wait_for_capacity
health_check_type = "EC2"
dynamic "tag" {
#some code here
}
launch_template {
id = aws_launch_template.server_launch_template.id
version = "$Latest"
}
lifecycle {
create_before_destroy = true
}
}
How and where do I specify instance name in the launch template?
You can't define dynamic names for instances launched by an autoscaling group.
You can however configure a lambda function to run whenever the autoscaling launches new instances, and you can name the instances from the lambda.
This works. As #MarkoE suggested added Name tag in tag_specifications
resource "aws_launch_template" "server_launch_template" {
name = "my-launch-template"
image_id = "my-windows-ami-id"
instance_type = "t3.medium"
key_name = "my-keypair"
vpc_security_group_ids = [var.security_group_id]
iam_instance_profile {
arn = aws_iam_instance_profile.my_instance.arn
}
tag_specifications {
resource_type = "instance"
tags = merge(module.tags.mytags, { Name = "my-runner-instance" })
}
lifecycle {
create_before_destroy = true
}
}
I am data bricks cluster using terraform with below code.
resource "azurerm_resource_group" "myresourcegroup" {
name = "${var.applicationName}-${var.environment}-rg"
location = var.location
tags = {
environment = var.environment
}
}
resource "azurerm_databricks_workspace" "dbworkspace" {
name = "${var.applicationName}-${var.environment}-workspace"
resource_group_name = "${var.applicationName}-${var.environment}-rg"
location = var.location
sku = var.databricks_sku
custom_parameters {
no_public_ip = "true"
virtual_network_id = azurerm_virtual_network.vnet.id
public_subnet_name = azurerm_subnet.public_subnet.name
private_subnet_name = azurerm_subnet.private_subnet.name
public_subnet_network_security_group_association_id = azurerm_subnet.public_subnet.id
private_subnet_network_security_group_association_id = azurerm_subnet.private_subnet.id
}
depends_on = [azurerm_resource_group.myresourcegroup, azurerm_network_security_group.nsg, azurerm_virtual_network.vnet, azurerm_subnet.public_subnet, azurerm_subnet.private_subnet, azurerm_subnet_network_security_group_association.public-sn-nsg-assoc, azurerm_subnet_network_security_group_association.private-sn-nsg-assoc]
}
# Databricks Cluster
resource "databricks_cluster" "dbcluster" {
cluster_name = "${var.applicationName}-${var.environment}-cluster"
spark_version = "10.4.x-scala2.12"
node_type_id = "Standard_DS3_v2"
autotermination_minutes = 10
enable_local_disk_encryption = true
is_pinned = "true"
autoscale {
min_workers = 1
max_workers = 8
}
# spark_conf = {
# "spark.databricks.delta.optimizeWrite.enabled": true,
# "spark.databricks.delta.autoCompact.enabled": true,
# "spark.databricks.delta.preview.enabled": true,
# }
depends_on = [azurerm_resource_group.myresourcegroup, azurerm_network_security_group.nsg, azurerm_virtual_network.vnet, azurerm_subnet.public_subnet, azurerm_subnet.private_subnet, azurerm_subnet_network_security_group_association.public-sn-nsg-assoc, azurerm_subnet_network_security_group_association.private-sn-nsg-assoc, azurerm_databricks_workspace.dbworkspace]
}
My resource group and databricks workspace are creating fine but data bricks cluster is not getting created. When I see plan and apply I can see it is creating. I don't know what I am missing.
resource "aws_eks_node_group" "n-cluster-group" {
cluster_name = aws_eks_cluster.n-cluster.name
node_group_name = "n-cluster-group"
node_role_arn = aws_iam_role.eks-nodegroup.arn
subnet_ids = [aws_subnet.public.id, aws_subnet.public2.id]
scaling_config {
desired_size = 3
max_size = 6
min_size = 1
}
launch_template {
id = aws_launch_template.n-cluster.id
version = aws_launch_template.n-cluster.latest_version
}
depends_on = [
aws_iam_role_policy_attachment.AmazonEKSWorkerNodePolicy,
aws_iam_role_policy_attachment.AmazonEC2ContainerRegistryReadOnly,
aws_iam_role_policy_attachment.AmazonEKS_CNI_Policy,
]
resource "aws_launch_template" "n-cluster" {
image_id = "ami-0d45236a5972906dd"
instance_type = "t3.medium"
name_prefix = "cluster-node-"
block_device_mappings {
device_name = "/dev/sda1"
ebs {
volume_size = 20
}
}
Although instances appear to successfully createthe node group status is CREATE_FAILED terraform reports this as well.
I am wondering what CREATE_FAILED means
what am I dooing wrong? when using a launch group and an eks optomized AMI should I still specify user_data and if so what is the correct way to do this using terraform.
I managed to solve the issue with the following configurations:
resource "aws_launch_template" "eks_launch_template" {
name = "eks_launch_template"
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = 20
volume_type = "gp2"
}
}
image_id = <custom_ami_id>
instance_type = "t3.medium"
user_data = filebase64("${path.module}/eks-user-data.sh")
tag_specifications {
resource_type = "instance"
tags = {
Name = "EKS-MANAGED-NODE"
}
}
}
resource "aws_eks_node_group" "eks-cluster-ng" {
cluster_name = aws_eks_cluster.eks-cluster.name
node_group_name = "eks-cluster-ng-"
node_role_arn = aws_iam_role.eks-cluster-ng.arn
subnet_ids = [var.network_subnets.pvt[0].id, var.network_subnets.pvt[1].id, var.network_subnets.pvt[2].id]
scaling_config {
desired_size = var.asg_desired_size
max_size = var.asg_max_size
min_size = var.asg_min_size
}
launch_template {
name = aws_launch_template.eks_launch_template.name
version = aws_launch_template.eks_launch_template.latest_version
}
depends_on = [
aws_iam_role_policy_attachment.AmazonEKSWorkerNodePolicy,
aws_iam_role_policy_attachment.AmazonEC2ContainerRegistryReadOnly,
aws_iam_role_policy_attachment.AmazonEKS_CNI_Policy,
]
}
The key lies with user_data = filebase64("${path.module}/eks-user-data.sh")
The eks-user-data.sh file should be something like this:
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="
--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
/etc/eks/bootstrap.sh <cluster-name>
--==MYBOUNDARY==--\
I have tested the above and it works as intended. Thanks all for leading me to this solution
Adding this to your launch template definition resolves it:
user_data = base64encode(<<-EOF
#!/bin/bash -xe
/etc/eks/bootstrap.sh CLUSTER_NAME_HERE
EOF
)
I guess even a EKS optimised AMI counts as a custom AMI if used via launch template.
I tried to build an ECS cluster with ALB in front using terraform. As I used dynamic port mappping the targets will not be registerd as healthy. I played with the healthcheck and Success codes if I set it to 301 everything is fine.
ECS
data "template_file" "mb_task_template" {
template = file("${path.module}/templates/marketplace-backend.json.tpl")
vars = {
name = "${var.mb_image_name}"
port = "${var.mb_port}"
image = "${aws_ecr_repository.mb.repository_url}"
log_group = "${aws_cloudwatch_log_group.mb.name}"
region = "${var.region}"
}
}
resource "aws_ecs_cluster" "mb" {
name = var.mb_image_name
}
resource "aws_ecs_task_definition" "mb" {
family = var.mb_image_name
container_definitions = data.template_file.mb_task_template.rendered
volume {
name = "mb-home"
host_path = "/ecs/mb-home"
}
}
resource "aws_ecs_service" "mb" {
name = var.mb_repository_url
cluster = aws_ecs_cluster.mb.id
task_definition = aws_ecs_task_definition.mb.arn
desired_count = 2
iam_role = var.aws_iam_role_ecs
depends_on = [aws_autoscaling_group.mb]
load_balancer {
target_group_arn = var.target_group_arn
container_name = var.mb_image_name
container_port = var.mb_port
}
}
resource "aws_autoscaling_group" "mb" {
name = var.mb_image_name
availability_zones = ["${var.availability_zone}"]
min_size = var.min_instance_size
max_size = var.max_instance_size
desired_capacity = var.desired_instance_capacity
health_check_type = "EC2"
health_check_grace_period = 300
launch_configuration = aws_launch_configuration.mb.name
vpc_zone_identifier = flatten([var.vpc_zone_identifier])
lifecycle {
create_before_destroy = true
}
}
data "template_file" "user_data" {
template = file("${path.module}/templates/user_data.tpl")
vars = {
ecs_cluster_name = "${var.mb_image_name}"
}
}
resource "aws_launch_configuration" "mb" {
name_prefix = var.mb_image_name
image_id = var.ami
instance_type = var.instance_type
security_groups = ["${var.aws_security_group}"]
iam_instance_profile = var.aws_iam_instance_profile
key_name = var.key_name
associate_public_ip_address = true
user_data = data.template_file.user_data.rendered
lifecycle {
create_before_destroy = true
}
}
resource "aws_cloudwatch_log_group" "mb" {
name = var.mb_image_name
retention_in_days = 14
}
ALB
locals {
target_groups = ["1", "2"]
}
resource "aws_alb" "mb" {
name = "${var.mb_image_name}-alb"
internal = false
load_balancer_type = "application"
security_groups = ["${aws_security_group.mb_alb.id}"]
subnets = var.subnets
tags = {
Name = var.mb_image_name
}
}
resource "aws_alb_target_group" "mb" {
count = length(local.target_groups)
name = "${var.mb_image_name}-tg-${element(local.target_groups, count.index)}"
port = var.mb_port
protocol = "HTTP"
vpc_id = var.vpc_id
target_type = "instance"
health_check {
path = "/health"
protocol = "HTTP"
timeout = "10"
interval = "15"
healthy_threshold = "3"
unhealthy_threshold = "3"
matcher = "200-299"
}
lifecycle {
create_before_destroy = true
}
tags = {
Name = var.mb_image_name
}
}
resource "aws_alb_listener" "mb_https" {
load_balancer_arn = aws_alb.mb.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-2016-08"
certificate_arn = module.dns.certificate_arn
default_action {
type = "forward"
target_group_arn = aws_alb_target_group.mb.0.arn
}
}
resource "aws_alb_listener_rule" "mb_https" {
listener_arn = aws_alb_listener.mb_https.arn
priority = 100
action {
type = "forward"
target_group_arn = aws_alb_target_group.mb.0.arn
}
condition {
field = "path-pattern"
values = ["/health/"]
}
}
Okay. Looks like the code above is working. I had different issue with networking.
When I update desired_count, the terraform planner shows that the operation will be an update in-place. However, when terraform tries to apply the changes, I get the following error:
Terraform v0.12.21
Initializing plugins and modules...
2020/03/05 22:10:52 [DEBUG] Using modified User-Agent: Terraform/0.12.21 TFC/8f5a579db5
module.web.aws_ecs_service.web[0]: Modifying... [id=arn:aws:ecs:us-east-1:55555:service/web/web]
Error: Error updating ECS Service (arn:aws:ecs:us-east-1:55555:service/web/web): InvalidParameterException: Unable to update network parameters on services with a CODE_DEPLOY deployment controller. Use AWS CodeDeploy to trigger a new deployment.
The terraform code used to reproduce this looks something like:
resource "aws_lb" "platform" {
name = "platform"
internal = false
load_balancer_type = "application"
ip_address_type = "ipv4"
security_groups = [aws_security_group.lb.id]
subnets = [for subnet in aws_subnet.lb : subnet.id]
enable_deletion_protection = true
tags = {
Name = "platform"
Type = "Public"
}
}
resource "aws_lb_target_group" "platform" {
count = 2
name = "platform-tg-${count.index + 1}"
vpc_id = var.vpc_id
protocol = "HTTP"
port = 80
target_type = "ip"
stickiness {
type = "lb_cookie"
enabled = false
}
health_check {
path = "/healthcheck"
port = var.container_port
protocol = "HTTP"
timeout = 5
healthy_threshold = 5
unhealthy_threshold = 3
matcher = "200"
}
tags = {
Name = "platform-tg-${count.index + 1}"
Type = "Public"
}
}
resource "aws_lb_listener" "platform-https" {
load_balancer_arn = aws_lb.platform.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS-1-2-Ext-2018-06"
certificate_arn = var.certificate_arn
depends_on = [aws_lb_target_group.platform]
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.platform[0].arn
}
lifecycle {
ignore_changes = [
default_action
]
}
}
locals {
family = "platform"
container_name = "web"
}
resource "aws_cloudwatch_log_group" "platform" {
name = "/aws/ecs/platform"
retention_in_days = 3653
tags = {
Name = "platform"
}
}
resource "aws_ecs_task_definition" "platform" {
family = local.family
requires_compatibilities = ["FARGATE"]
cpu = var.service.cpu
memory = var.service.memory
network_mode = "awsvpc"
execution_role_arn = aws_iam_role.ecs_task_execution.arn
task_role_arn = aws_iam_role.ecs_task_execution.arn
container_definitions = jsonencode(
jsondecode(
templatefile("${path.module}/taskdef.json", {
family = local.family
container_name = local.container_name
region = var.region
account_id = var.account_id
cpu = var.service.cpu
memory = var.service.memory
image = var.service.container_image
log_group = aws_cloudwatch_log_group.platform.name
node_env = var.node_env
port = var.container_port
platform_url = var.platform_url
short_url = var.short_url
cdn_url = var.cdn_url
})
).containerDefinitions
)
tags = {
Name = "platform"
Type = "Private"
}
}
resource "aws_ecs_cluster" "platform" {
name = "platform"
setting {
name = "containerInsights"
value = "enabled"
}
tags = {
Name = "platform"
Type = "Public"
}
}
data "aws_lb_listener" "current-platform" {
arn = aws_lb_listener.platform-https.arn
}
data "aws_ecs_task_definition" "current-platform" {
task_definition = local.family
}
resource "aws_ecs_service" "platform" {
count = var.delete_platform_ecs_service ? 0 : 1
name = "platform"
cluster = aws_ecs_cluster.platform.arn
launch_type = "FARGATE"
desired_count = var.service.container_count
enable_ecs_managed_tags = true
task_definition = "${aws_ecs_task_definition.platform.family}:${max(aws_ecs_task_definition.platform.revision, data.aws_ecs_task_definition.current-platform.revision)}"
depends_on = [aws_lb_target_group.platform]
load_balancer {
target_group_arn = data.aws_lb_listener.current-platform.default_action[0].target_group_arn
container_name = local.container_name
container_port = var.container_port
}
network_configuration {
subnets = sort([for subnet in aws_subnet.ecs : subnet.id])
security_groups = [aws_security_group.ecs.id]
}
deployment_controller {
type = "CODE_DEPLOY"
}
lifecycle {
// NOTE: Based on: https://docs.aws.amazon.com/cli/latest/reference/ecs/update-service.html
// If the network configuration, platform version, or task definition need to be updated, a new AWS CodeDeploy deployment should be created.
ignore_changes = [
load_balancer,
network_configuration,
task_definition
]
}
tags = {
Name = "platform"
Type = "Private"
}
}
This is using Terraform v0.12.21. Full debug output is available at: https://gist.github.com/jgeurts/f4d930608a119e9cd75a7a54b111ee7c
This is maybe not the best answer, but I wasn't able to get terraform to adjust only the desired_count. Instead, I added auto scaling to the ECS service:
Ignore desired_count:
resource "aws_ecs_service" "platform" {
...
lifecycle {
// NOTE: Based on: https://docs.aws.amazon.com/cli/latest/reference/ecs/update-service.html
// If the network configuration, platform version, or task definition need to be updated, a new AWS CodeDeploy deployment should be created.
ignore_changes = [
desired_count, # Preserve desired count when updating an autoscaled ECS Service
load_balancer,
network_configuration,
task_definition,
]
}
}
Add auto-scaling:
resource "aws_appautoscaling_target" "platform" {
max_capacity = var.max_capacity
min_capacity = var.min_capacity
resource_id = "service/${aws_ecs_cluster.platform.name}/${aws_ecs_cluster.platform.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
depends_on = [
aws_ecs_cluster.platform,
]
}
resource "aws_appautoscaling_policy" "platform" {
name = "platform-auto-scale"
service_namespace = aws_appautoscaling_target.platform.service_namespace
resource_id = aws_appautoscaling_target.platform.resource_id
scalable_dimension = aws_appautoscaling_target.platform.scalable_dimension
policy_type = "TargetTrackingScaling"
target_tracking_scaling_policy_configuration {
target_value = var.service.autoscale_target_cpu_percentage
scale_out_cooldown = 60
scale_in_cooldown = 300
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
}
}
resource "aws_appautoscaling_scheduled_action" "platform_0430_increase_min_capacity" {
name = "platform-0430-increase-min-capacity"
schedule = "cron(30 4 * * ? *)"
service_namespace = aws_appautoscaling_target.platform.service_namespace
resource_id = aws_appautoscaling_target.platform.resource_id
scalable_dimension = aws_appautoscaling_target.platform.scalable_dimension
scalable_target_action {
min_capacity = var.min_capacity + 4
max_capacity = var.max_capacity
}
}
resource "aws_appautoscaling_scheduled_action" "platform_0615_restore_min_capacity" {
name = "platform-0615-restore-min-capacity"
schedule = "cron(15 06 * * ? *)"
service_namespace = aws_appautoscaling_target.platform.service_namespace
resource_id = aws_appautoscaling_target.platform.resource_id
scalable_dimension = aws_appautoscaling_target.platform.scalable_dimension
scalable_target_action {
min_capacity = var.min_capacity
max_capacity = var.max_capacity
}
}
resource "aws_appautoscaling_scheduled_action" "platform_weekday_0945_increase_min_capacity" {
name = "platform-weekday-0945-increase-min-capacity"
schedule = "cron(45 9 ? * MON-FRI *)"
service_namespace = aws_appautoscaling_target.platform.service_namespace
resource_id = aws_appautoscaling_target.platform.resource_id
scalable_dimension = aws_appautoscaling_target.platform.scalable_dimension
scalable_target_action {
min_capacity = var.min_capacity + 4
max_capacity = var.max_capacity
}
}
resource "aws_appautoscaling_scheduled_action" "platform_weekday_2100_restore_min_capacity" {
name = "platform-weekday-2100-restore-min-capacity"
schedule = "cron(0 2100 ? * MON-FRI *)"
service_namespace = aws_appautoscaling_target.platform.service_namespace
resource_id = aws_appautoscaling_target.platform.resource_id
scalable_dimension = aws_appautoscaling_target.platform.scalable_dimension
scalable_target_action {
min_capacity = var.min_capacity
max_capacity = var.max_capacity
}
}