Issue with AWS ECS Auto-Scaling and Binpack Task Placement Strategy: Tasks Not Shifting Back After Scale-In

0

In AWS ECS, I use auto-scaling and a binpack task placement strategy. I am facing an issue where, once the tasks scale up and instances are attached to ECS, after a scale-in event, some tasks remain on different instances and do not shift back to fewer instances as expected. How can I resolve this issue?

preguntada hace 6 meses449 visualizaciones
2 Respuestas
1
Respuesta aceptada

Hi Tharunkumar,

Please try this below solution, I hope it will help to resolve your issue.

Implement an ECS Task Rebalancer:

  1. Create a Lambda Function: This function will check the task placement and stop tasks that need to be redistributed.

  2. Invoke Lambda Function Periodically: Use CloudWatch Events to trigger the Lambda function at regular intervals.

  3. CloudFormation Template: Use a CloudFormation template to create the Lambda function and set up the CloudWatch Event rule.

Lambda Function (Python)

This function lists tasks in your ECS cluster, groups them by instance, and stops tasks from under-utilized instances:


import boto3

ecs_client = boto3.client('ecs')

def lambda_handler(event, context):
    cluster_name = 'your-cluster-name'
    service_name = 'your-service-name'
    
    # List tasks
    tasks = ecs_client.list_tasks(cluster=cluster_name, serviceName=service_name)['taskArns']
    
    # Describe tasks
    tasks_details = ecs_client.describe_tasks(cluster=cluster_name, tasks=tasks)['tasks']
    
    # Group tasks by instance
    tasks_by_instance = {}
    for task in tasks_details:
        instance_id = task['containerInstanceArn']
        if instance_id not in tasks_by_instance:
            tasks_by_instance[instance_id] = []
        tasks_by_instance[instance_id].append(task['taskArn'])
    
    # Example logic: Stop tasks from under-utilized instances
    for instance_id, task_arns in tasks_by_instance.items():
        if len(task_arns) == 1:  # Adjust this threshold based on your binpack strategy
            ecs_client.stop_task(cluster=cluster_name, task=task_arns[0])
    
    return {
        'statusCode': 200,
        'body': 'Rebalanced tasks'
    }

CloudFormation Template

This template sets up the Lambda function and the CloudWatch Event rule to trigger it periodically:


AWSTemplateFormatVersion: '2010-09-09'
Resources:
  MyLambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      Policies:
        - PolicyName: LambdaExecutionPolicy
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                  - ecs:ListTasks
                  - ecs:DescribeTasks
                  - ecs:StopTask
                Resource: '*'

  MyLambdaFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: MyRebalanceFunction
      Handler: index.lambda_handler
      Role: !GetAtt MyLambdaExecutionRole.Arn
      Code:
        ZipFile: |
          import boto3
          
          ecs_client = boto3.client('ecs')

          def lambda_handler(event, context):
              cluster_name = 'your-cluster-name'
              service_name = 'your-service-name'
              
              # List tasks
              tasks = ecs_client.list_tasks(cluster=cluster_name, serviceName=service_name)['taskArns']
              
              # Describe tasks
              tasks_details = ecs_client.describe_tasks(cluster=cluster_name, tasks=tasks)['tasks']
              
              # Group tasks by instance
              tasks_by_instance = {}
              for task in tasks_details:
                  instance_id = task['containerInstanceArn']
                  if instance_id not in tasks_by_instance:
                      tasks_by_instance[instance_id] = []
                  tasks_by_instance[instance_id].append(task['taskArn'])
              
              # Example logic: Stop tasks from under-utilized instances
              for instance_id, task_arns in tasks_by_instance.items():
                  if len(task_arns) == 1:  # Adjust this threshold based on your binpack strategy
                      ecs_client.stop_task(cluster=cluster_name, task=task_arns[0])
              
              return {
                  'statusCode': 200,
                  'body': 'Rebalanced tasks'
              }
      Runtime: python3.8

  CloudWatchEventRule:
    Type: AWS::Events::Rule
    Properties:
      ScheduleExpression: rate(5 minutes)
      Targets:
        - Arn: !GetAtt MyLambdaFunction.Arn
          Id: "TargetFunctionV1"

  LambdaInvokePermission:
    Type: AWS::Lambda::Permission
    Properties:
      Action: lambda:InvokeFunction
      FunctionName: !Ref MyLambdaFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt CloudWatchEventRule.Arn

Please go through the below useful AWS documentation links for the services involved

1. Lambda

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-function.html

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-lambda-permission.html

2. AWS IAM

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-iam-role.html

3. AWS Cloud Watch Events

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-events-rule.html

4.AWS ECS:

https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_ListTasks.html

https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_DescribeTasks.html

https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_StopTask.html

EXPERTO
respondido hace 6 meses
EXPERTO
revisado hace 4 meses
EXPERTO
revisado hace 6 meses
0

Are you using a Capacity Provider for the ASG? If so, do you have Managed Termination Protection feature enabled? This will prevent instances from being scaled-in as long as there's any replica tasks running on them.

If you want to have tasks killed and replaced on new instances to binpack better, disabled Managed Termination Protection, and instead enabled Managed Draining and set the target value to 100

AWS
EXPERTO
respondido hace 6 meses
  • I done this but it is not working

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas