My ECS service failing after task starts

0

I created ECS service task like these

{
    "containerDefinitions": [
        {
            "name": "api",
            "image": "414397229292.dkr.ecr.ap-south-1.amazonaws.com/bids365/backend:79cb8aa",
            "cpu": 256,
            "memoryReservation": 512,
            "portMappings": [
                {
                    "containerPort": 3000,
                    "hostPort": 3000,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "environment": [
                {
                    "name": "IMPART_ENV",
                    "value": "dev"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "disableNetworking": false,
            "privileged": false,
            "readonlyRootFilesystem": true,
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "bids365/backend/dev",
                    "awslogs-region": "ap-south-1",
                    "awslogs-stream-prefix": "dev"
                }
            },
            "healthCheck": {
                "command": [
                    "CMD-SHELL",
                    "curl -f http://localhost:3000/ping || exit 1"
                ],
                "interval": 60,
                "timeout": 2,
                "retries": 3,
                "startPeriod": 60
            }
        }
    ],
    "family": "service-dev",
    "taskRoleArn": "arn:aws:iam::414397229292:role/task-role-backend-dev",
    "executionRoleArn": "arn:aws:iam::414397229292:role/ecs-task-execution-role",
    "networkMode": "awsvpc",
    "volumes": [],
    "placementConstraints": [],
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "cpu": "256",
    "memory": "512"
}

my alb is created like these using terraform

`

resource "aws_lb" "backend" { name = "backend-alb" internal = false load_balancer_type = "application" security_groups = [module.sg.id] subnets = ["subnet-082af865c1410b0ef", "subnet-0da703055394aa446", "subnet-02d9bc7c78c939446"]

enable_deletion_protection = false //todo - change this when we get more clarity

tags = { Environment = "production" } lifecycle { prevent_destroy = false } }

module "sg" { source = "cloudposse/security-group/aws" version = "0.1.3" vpc_id = data.aws_vpc.bid365-backend.id delimiter = "" name = "443-ingress-private-egress" rules = [ { type = "egress" from_port = 0 to_port = 65535 protocol = "TCP" # cidr_blocks = [data.aws_vpc.bid365-backend.cidr_block] self = true }, { type = "ingress" from_port = 443 to_port = 443 protocol = "TCP" # cidr_blocks = ["0.0.0.0/0"] self = true }, { type = "ingress" from_port = 80 to_port = 80 protocol = "TCP" # cidr_blocks = ["0.0.0.0/0"] self = true }, { type = "ingress" from_port = 3000 to_port = 3000 protocol = "TCP" # cidr_blocks = ["0.0.0.0/0"] self = true } ] }

resource "aws_lb_listener" "redirect_non_ssl" { load_balancer_arn = aws_lb.backend.arn port = "80" protocol = "HTTP"

default_action { type = "redirect"

redirect {
  port        = "443"
  protocol    = "HTTPS"
  status_code = "HTTP_301"
}

} }

resource "aws_acm_certificate" "cert" {

domain_name = var.app_dns_entry

validation_method = "DNS"

lifecycle {

prevent_destroy = false

}

}

resource "aws_lb_listener" "app" { load_balancer_arn = aws_lb.backend.arn port = "443" protocol = "HTTPS" ssl_policy = "ELBSecurityPolicy-2016-08" certificate_arn = "arn:aws:acm:ap-south-1:414397229292:certificate/4290c5e1-4b49-40bf-afb5-bedeefd072c2"

default_action { type = "fixed-response" fixed_response { content_type = "text/plain" message_body = "Not Found\n" status_code = "404" } }

lifecycle { prevent_destroy = false } }

`

Everything is correct it seems and task and alb all created but it is failing at health check failed in target group. But the container port is mapped correctly and ping endpoint I created for health check also working correctly if I try to access from the container. Please help me I am stuck here for a long time and tried almost everything

4回答
0

Hello.
Is the path used for ALB health checks correct?
ALB performs health checks on "/" by default.
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html

profile picture
エキスパート
回答済み 9ヶ月前
  • Yes like these

    resource "aws_lb_target_group" "map" { for_each = var.target_groups name = "backend-${each.key}" vpc_id = data.aws_vpc.bid365-backend.id port = 3000 protocol = "HTTP" target_type = "ip" # Specify the target type as "ip" for Fargate health_check { enabled = true interval = 60 port = "traffic-port" path = "/ping" protocol = "HTTP" timeout = 5 healthy_threshold = 2 unhealthy_threshold = 3 matcher = "200" } }

0

Have been in this situation and I feel your pain. Here are a few things that I did

---Check 1

Check the security group inbound on the task/service. And outbound on the ALB

---Check 2

I would try SSHing onto the fargate task. Some helpful instructions on how to do this here

Then increase the healthcheck interval and count on the Targetgroup.

Then once you ssh in figure out of the healthcheck is actually working "http://localhost:3000/ping"

----Check 3

If you can't do check2 maybe try log the ping output in the container for some more clues.

回答済み 9ヶ月前
  • security group created for fargate like these

    ` resource "aws_security_group" "fargate" { name_prefix = "fargate-security-group-"

    vpc_id = "vpc-0370dd3da02a2770f" ingress { from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }

    ingress { from_port = 3000 to_port = 3000 protocol = "tcp" security_groups = ["sg-0d73dc6bd50a4d4a1"] # If you don't know the ELB's security group ID, use its CIDR range (e.g., 10.0.0.0/8): }

    egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } } `

    for ssh part I was even able to access the public_ip_of_task:3000/ping return success

0

Hows the security group on the ECS configured?

profile picture
エキスパート
回答済み 9ヶ月前
  • security group created like these `resource "aws_security_group" "fargate" { name_prefix = "fargate-security-group-"

    vpc_id = "vpc-0370dd3da02a2770f" ingress { from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }

    ingress { from_port = 3000 to_port = 3000 protocol = "tcp" security_groups = ["sg-0d73dc6bd50a4d4a1"] # If you don't know the ELB's security group ID, use its CIDR range (e.g., 10.0.0.0/8): }

    egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } }`

0

Hi,

Can you try without this "healthcheck" part in your Task definition

"healthCheck": {
    "command": [
        "CMD-SHELL",
        "curl -f http://localhost:3000/ping || exit 1"
    ],
    ...
}

If your task is working, it means that it's the problem. So probably, you have to allow localhost connection in your security group. Add this:

ingress {
    protocol    = "-1"
    cidr_blocks = ["127.0.0.1/32"]
}

By the way, you don't need to put CMD-SHELL because the target group also checks this path, it's redondant.

profile picture
Donov
回答済み 9ヶ月前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ