Capacity Provider never scale container instances on AWS Batch Unmanaged ECS
I am trying to implement ECS Autoscaling with Capacity Provider in an AWS Batch Unmanaged Compute Environment.
The following CloudFormation template was used to create the environment. The initial Desired Capacity of AutoScalingGroup is 0.
I submitted a job to AWS Batch, but the Capacity Provider does not scale Container Instances, so the job is stuck in the Runnable state. In this state, if you manually increase the Desired Capacity of the AutoScalingGroup, the Container Instances will scale and the job will run.
Also, when the Desired Capacity of the AutoScalingGroup is 0, if you execute an ECS task manually, the Capacity Provider will change the Desired Capacity of the AutoScalingGroup and the Container Instances will be scaled.
What changes should be made so that the Capacity Provider can successfully scale Container Instances and execute jobs by submitting a Job in AWS Batch?
[CloudFormation Template]:
AWSTemplateFormatVersion: '2010-09-09' Description: > AWS Batch Unmanged ECS Capacity Provider Test Parameters: ServiceName: Type: String Default: "test-batch-unmanaged" AvailabilityZone: Type: String Default: "ap-northeast-1a" BatchInstanceAMI: Type: AWS::EC2::Image::Id Description: Batch ECS Instance AMI Default: ami-0049422eda1bb52a7 # ECS Optimized AMI Resources: BatchVPC: Type: AWS::EC2::VPC Properties: CidrBlock: 10.123.0.0/24 EnableDnsSupport: true EnableDnsHostnames: true InstanceTenancy: default Tags: - Key: Name Value: !Sub "${ServiceName}-vpc" BatchInstanceRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: - ec2.amazonaws.com - spotfleet.amazonaws.com Action: - sts:AssumeRole Path: "/" ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceforEC2Role - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore BatchInstanceProfile: Type: AWS::IAM::InstanceProfile Properties: Path: "/" Roles: - !Ref BatchInstanceRole BatchInstanceSecurityGroup: Type: AWS::EC2::SecurityGroup Properties: VpcId: !Ref BatchVPC GroupDescription: "Youtube Transcriber Batch Security Group" SecurityGroupIngress: - IpProtocol: "tcp" FromPort: "22" ToPort: "22" CidrIp: 0.0.0.0/0 JobRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: - ecs-tasks.amazonaws.com Action: - sts:AssumeRole Path: "/" ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy PublicRouteTable: Type: AWS::EC2::RouteTable Properties: VpcId: !Ref BatchVPC Tags: - Key: Name Value: !Sub "${ServiceName}-public-route" PublicSubnet: Type: AWS::EC2::Subnet Properties: VpcId: !Ref BatchVPC CidrBlock: 10.123.0.0/26 AvailabilityZone: !Ref AvailabilityZone MapPublicIpOnLaunch: true Tags: - Key: Name Value: !Sub "${ServiceName}-public-subnet" PublicSubnetRouteTableAssociation: Type: AWS::EC2::SubnetRouteTableAssociation Properties: SubnetId: !Ref PublicSubnet RouteTableId: !Ref PublicRouteTable InternetGateway: Type: AWS::EC2::InternetGateway Properties: Tags: - Key: Name Value: !Sub "${ServiceName}-igw" AttachGateway: Type: AWS::EC2::VPCGatewayAttachment Properties: VpcId: !Ref BatchVPC InternetGatewayId: !Ref InternetGateway PublicRoutes: Type: AWS::EC2::Route DependsOn: AttachGateway Properties: RouteTableId: !Ref PublicRouteTable DestinationCidrBlock: 0.0.0.0/0 GatewayId: !Ref InternetGateway FleetRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: - spotfleet.amazonaws.com Action: - sts:AssumeRole Path: "/" ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AmazonEC2SpotFleetTaggingRole BatchServiceRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: - batch.amazonaws.com Action: - sts:AssumeRole Path: "/" ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AWSBatchServiceRole ComputeEnvironment: Type: AWS::Batch::ComputeEnvironment Properties: Type: UNMANAGED ServiceRole: !GetAtt BatchServiceRole.Arn ComputeEnvironmentName: !Sub "${ServiceName}-ce-${BatchInstanceAMI}" State: ENABLED EcsClusterArnOfCELambdaRole: Type: AWS::IAM::Role Properties: AssumeRolePolicyDocument: Version: "2012-10-17" Statement: - Effect: Allow Principal: Service: - lambda.amazonaws.com Action: - sts:AssumeRole ManagedPolicyArns: - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole - arn:aws:iam::aws:policy/AWSBatchFullAccess EcsClusterArnOfCELambda: Type: AWS::Lambda::Function Properties: FunctionName: CustomResourceEcsClusterArnOfCE Handler: index.lambda_handler Runtime: python3.9 Role: !GetAtt EcsClusterArnOfCELambdaRole.Arn MemorySize: 128 Timeout: 300 Code: ZipFile: | import boto3 import logging logger = logging.getLogger("EcsClusterArnOfCE") logger.setLevel(logging.INFO) batchClient = boto3.client('batch') def lambda_handler(event, context): logger.info(event) import cfnresponse try: if event['RequestType'] == 'Delete': cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Response': 'Success', 'EcsClusterArn': '' }) return # Following Create or Update isWaitForValid = event['ResourceProperties']['WaitForValid'] isWaitForValid = bool(isWaitForValid) if isWaitForValid else True ceName = event['ResourceProperties']['CEName'] while True: response = batchClient.describe_compute_environments( computeEnvironments = [ ceName ] ) logger.info(response) ce = response['computeEnvironments'][0] if not isWaitForValid or ce['status'] == 'VALID': break logger.info('wait for status to valid') logger.info(ce) sleep(5) ecsClusterArn = ce['ecsClusterArn'] if ecsClusterArn: cfnresponse.send(event, context, cfnresponse.SUCCESS, {'Response': 'Success', 'EcsClusterArn': ecsClusterArn}) else: logger.error("EcsClusterArn is null") cfnresponse.send(event, context, cfnresponse.FAILED, {'Response': 'Failure', 'EcsClusterArn': ''}) except Exception as e: logger.error(e) cfnresponse.send(event, context, cfnresponse.FAILED, {'Response': 'Failure', 'EcsClusterArn': ''}) EcsClusterArnOfCE: Type: Custom::EcsClusterArnOfCE Properties: ServiceToken: !GetAtt EcsClusterArnOfCELambda.Arn CEName: !Ref ComputeEnvironment WaitForValid: True BatchComputeLaunchTemplate: Type: AWS::EC2::LaunchTemplate Properties: LaunchTemplateName: !Sub "${ServiceName}-batch-launch-template" LaunchTemplateData: ImageId: !Ref BatchInstanceAMI IamInstanceProfile: Arn: !GetAtt BatchInstanceProfile.Arn InstanceType: t3.micro InstanceMarketOptions: MarketType: spot SpotOptions: SpotInstanceType: one-time EbsOptimized: True UserData: Fn::Base64: !Sub | #!/bin/bash cat <<'EOF' >> /etc/ecs/ecs.config ECS_CLUSTER=${EcsClusterArnOfCE.EcsClusterArn} EOF ASGCompute: Type: AWS::AutoScaling::AutoScalingGroup Properties: CapacityRebalance: True MinSize: 0 MaxSize: 5 NewInstancesProtectedFromScaleIn: False LaunchTemplate: LaunchTemplateId: !Ref BatchComputeLaunchTemplate Version: !GetAtt BatchComputeLaunchTemplate.LatestVersionNumber VPCZoneIdentifier: - !Ref PublicSubnet Tags: - Key: Name Value: !Sub "${ServiceName}-batch-asg" PropagateAtLaunch: True UpdatePolicy: AutoScalingReplacingUpdate: WillReplace: True BatchCapacityProvider: Type: AWS::ECS::CapacityProvider Properties: AutoScalingGroupProvider: AutoScalingGroupArn: !Ref ASGCompute ManagedScaling: Status: ENABLED TargetCapacity: 100 MaximumScalingStepSize: 10 MinimumScalingStepSize: 1 InstanceWarmupPeriod: 60 ManagedTerminationProtection: DISABLED ManagedDraining: ENABLED BatchCapacityProviderAssociations: Type: AWS::ECS::ClusterCapacityProviderAssociations Properties: CapacityProviders: - !Ref BatchCapacityProvider Cluster: !GetAtt EcsClusterArnOfCE.EcsClusterArn DefaultCapacityProviderStrategy: - CapacityProvider: !Ref BatchCapacityProvider Weight: 1 Base: 0 BatchJobQueue: Type: AWS::Batch::JobQueue Properties: JobQueueName: !Sub "${ServiceName}-job-queue" ComputeEnvironmentOrder: - ComputeEnvironment: !Ref ComputeEnvironment Order: 1 Priority: 1 State: ENABLED BatchJobDefinition: Type: AWS::Batch::JobDefinition Properties: Type: container JobDefinitionName: !Sub "${ServiceName}-batch" Parameters: Param: 'test' ContainerProperties: Command: - echo - 'Ref::Param' ResourceRequirements: - Type: MEMORY Value: 256 - Type: VCPU Value: 1 JobRoleArn: !Ref JobRole Image: !Sub "busybox:latest" Timeout: AttemptDurationSeconds: 3600 RetryStrategy: Attempts: 1 Outputs: BatchJobQueue: Value: !Ref BatchJobQueue BatchJobDefinition: Value: !Ref BatchJobDefinition
[Reproduction codes (CLI)]:
STACK_NAME=batch-unmanaged-test # create stack STACK_ARN=$(aws cloudformation create-stack --stack-name $STACK_NAME --template-body file://`pwd`/batch-stack-template.yaml --capabilities CAPABILITY_NAMED_IAM | jq -r .StackId) # wait for complete aws cloudformation wait stack-create-complete --stack-name $STACK_ARN # read parameter from stack outputs BATCH_JOB_QUEUE=$(aws cloudformation describe-stacks --stack-name $STACK_ARN | jq -r '.Stacks[0].Outputs[] | select(.OutputKey == "BatchJobQueue") | .OutputValue') BATCH_JOB_DEFINITION=$(aws cloudformation describe-stacks --stack-name $STACK_ARN | jq -r '.Stacks[0].Outputs[] | select(.OutputKey == "BatchJobDefinition") | .OutputValue') # submit batch job (job submit ok, but never it runs, because of a capacity provider don't scale container instances) aws batch submit-job --job-name batch-submit-test --job-queue $BATCH_JOB_QUEUE --job-definition $BATCH_JOB_DEFINITION # delete stack # aws cloudformation delete-stack --stack-name $STACK_NAME
- 最新
- 最多得票
- 最多評論
Hello,
I doubt ECS capacity providers can be used for unmanaged compute environment for AWS Batch. You can specify in Launch template using the script in the user data to register the EC2 autoscaling instances into an ECS cluster and also specify the minimum number of Instances to launch in the EC2 Autoscaling group.
After you created your unmanaged compute environment, use the DescribeComputeEnvironments API operation to view the compute environment details. Find the Amazon ECS cluster that's associated with the environment and then manually launch your container instances into that Amazon ECS cluster.
References:
相關內容
- 已提問 9 個月前lg...
- 已提問 5 個月前lg...
- AWS 官方已更新 2 年前
- AWS 官方已更新 2 年前
- AWS 官方已更新 3 年前
- AWS 官方已更新 8 個月前
Experiencing the same thing. AWS Batch is quite confusing, when you try to understand how it interacts with AutoScaling.