Hi,
I have a CDK stack that creates an application load balanced ECS service on a cluster in my VPC. When I create the cluster with hard-coded capacity instances, the services deploy and reach steady state and the application works as expected. When I remove the hard-coded instances and instead provide an EC2 capacity provider and auto scaling group, I cannot deploy the application. The ECS service is created and tasks are placed, and the tasks even show a "Healthy" state in the ECS dashboard. The application logs show successful healthchecks (i.e. 200 statuses are being returned periodically) but after a few minutes, ECS kills the healthy task. ECS indicates the task was killed, with an event listed: "service Foo instance Foo port 32768 is unhealthy in target-group Foo due to (reason Request timed out)". When I examine the killed task after shutdown, the console reports "Task failed ELB health checks in (target-group arn:aws:elasticloadbalancing:us-west-2:Foo:Foo/Foo)".
I suspect that switching to the instances provided by the capacity provider has complicated the networking situation, resulting in the ELB being unable to reach the task. The requests being logged are likely just container health checks. What I can't figure out is how to repair the load balancer's path to the containers - when I create the autoscaling group, I'm passing in the same VPC as I'm providing to the cluster (and that cluster is passed to the ApplicationLoadBalancedEc2Service CDK construct).
Here is a working code example with hard-coded capacity instances:
const vpc = new Vpc(this, `vpc-${props.environmentName}`, {});
const clusterId = `cluster-${props.environmentName}`;
const cluster = new ECS.Cluster(this, `my-cluster`, {
clusterName: clusterId,
vpc,
capacity: {
instanceType: new EC2.InstanceType(props.clusterInstanceType),
},
containerInsights: true,
});
const serverTaskId = `server-task-${props.environmentName}`
const serverTaskDefinition = new ECS.TaskDefinition(this, serverTaskId, {
compatibility: ECS.Compatibility.EC2,
});
const serverContainer = serverTaskDefinition.addContainer('ServerContainer', {
image: ContainerImage.fromEcrRepository(serverRepo, latestTag),
containerName: 'my-server',
memoryReservationMiB: 1024,
portMappings: [
{
containerPort: Port.HTTP,
}
],
healthCheck: {
command: [ `curl localhost/healthcheck` ],
interval: cdk.Duration.seconds(120),
},
});
const loadBalancedAPIService = new ECSPatterns.ApplicationLoadBalancedEc2Service(this, `server-service`, {
cluster,
taskDefinition: serverTaskDefinition,
cpu: 256,
memoryReservationMiB: 512,
desiredCount: 2,
protocol: LoadBalancingV2.ApplicationProtocol.HTTPS,
openListener: true,
domainZone: hostedZone,
certificate: cert,
publicLoadBalancer: true,
maxHealthyPercent: 200,
minHealthyPercent: 100,
});
loadBalancedAPIService.targetGroup.configureHealthCheck({
path: "/healthcheck",
port: 'traffic-port',
protocol: LoadBalancingV2.Protocol.HTTP,
interval: cdk.Duration.seconds(120),
healthyThresholdCount: 2,
unhealthyThresholdCount: 2,
});
Here is the broken code example with the EC2 capacity provider and autoscaling group:
const vpc = new Vpc(this, `vpc-${props.environmentName}`, {});
const clusterId = `cluster-${props.environmentName}`;
const cluster = new ECS.Cluster(this, `my-cluster`, {
clusterName: clusterId,
vpc,
containerInsights: true,
});
const autoScalingGroup = new AutoScaling.AutoScalingGroup(this, `cluster-asg`, {
vpc,
instanceType: new EC2.InstanceType(props.clusterInstanceType),
machineImage: ECS.EcsOptimizedImage.amazonLinux2(),
maxCapacity: props.maxClusterInstanceCount,
minCapacity: 1,
healthCheck: AutoScaling.HealthCheck.ec2({
grace: cdk.Duration.seconds(240),
})
});
const capacityProvider = new ECS.AsgCapacityProvider(this, `cluster-capacity-provider`, {
autoScalingGroup,
maximumScalingStepSize: 2,
minimumScalingStepSize: 1,
canContainersAccessInstanceRole: true,
});
const capacityProviderStrategy: ECS.CapacityProviderStrategy = {
capacityProvider: capacityProvider.capacityProviderName,
weight: 1,
base: 1,
};
cluster.addAsgCapacityProvider(capacityProvider, {
canContainersAccessInstanceRole: true,
machineImageType: ECS.MachineImageType.AMAZON_LINUX_2,
});
cluster.addDefaultCapacityProviderStrategy([
capacityProviderStrategy,
]);
const serverTaskId = `server-task-${props.environmentName}`
const serverTaskDefinition = new ECS.TaskDefinition(this, serverTaskId, {
compatibility: ECS.Compatibility.EC2,
});
const loadBalancedAPIService = new ECSPatterns.ApplicationLoadBalancedEc2Service(this, `server-service`, {
cluster,
healthCheckGracePeriod: cdk.Duration.minutes(3),
taskDefinition: serverTaskDefinition,
cpu: 256,
memoryReservationMiB: 512,
desiredCount: 1,
protocol: LoadBalancingV2.ApplicationProtocol.HTTPS,
openListener: true,
domainZone: hostedZone,
certificate: cert,
publicLoadBalancer: true,
maxHealthyPercent: 200,
minHealthyPercent: 100,
capacityProviderStrategies: [
{
capacityProvider: capacityProvider.capacityProviderName,
weight: 1,
base: 1,
}
],
});
loadBalancedAPIService.targetGroup.configureHealthCheck({
path: "/healthcheck",
port: 'traffic-port',
protocol: LoadBalancingV2.Protocol.HTTP,
interval: cdk.Duration.seconds(120),
healthyThresholdCount: 2,
unhealthyThresholdCount: 2,
});
I have tried adding explicit security groups to the load balancer and auto scaling group in the broken code above by adding:
const loadBalancerSG = new EC2.SecurityGroup(this, `loadbalancer-egress`, {
vpc,
allowAllIpv6Outbound: true,
});
loadBalancerSG.addIngressRule(EC2.Peer.ipv4('0.0.0.0/0'), EC2.Port.tcp(Port.HTTPS));
const allowInstanceTrafficFromAlbSecurityGroup = new EC2.SecurityGroup(this, `asg-ingress`, {
vpc,
allowAllOutbound: true,
});
allowInstanceTrafficFromAlbSecurityGroup.addIngressRule(loadBalancerSG, EC2.Port.tcp(Port.HTTP), "allow HTTP traffic from load balancer");
allowInstanceTrafficFromAlbSecurityGroup.addIngressRule(loadBalancerSG, EC2.Port.tcp(Port.HTTPS), "allow HTTPS traffic from load balancer");
autoScalingGroup.addSecurityGroup(allowInstanceTrafficFromAlbSecurityGroup);
I have tried the troubleshooting steps in this document (https://repost.aws/knowledge-center/troubleshoot-unhealthy-checks-ecs) which led me to add the security group definitions above, and now I'm seeing health check requests in the service logs BOTH from localhost and an IP in the 10.x.x.x range, which I assume to be the load balancer. Dozens of health check requests are succeeding, but still, the task is terminated with the same "Task failed ELB health checks" message.
Any advice would be appreciated, I am a bit out of my depth here and am not sure how the capacity provider is affecting the ALB networking. Thanks in advance!
Hi Shahad, thank you for your answer - I was able to use the EC2 network interfaces console to confirm the healthcheck requests are coming from network interfaces associated with the load balancer. The Reachability Analyzer shows the instance created by the capacity provider is reachable from the ALB network interface.
The task events still show "instance is unhealthy due to (reason Request timed out)". I've tried extending the grace period for the ELB health check, and the task stays alive as long as the grace period is in effect, but then dies once the grace period ends.
Are the tasks taking longer to reply than the ELB Target Groups healthcheck timeout? Otherwise, it sounds strange that it would be reachable, but still getting timeouts. Unless the tasks are being registered on dynamic ports which aren't included in the security groups. You may want to open a technical support case so someone can look at the setup of your specific resources