SQS depth based ECS task auto-scaling using step scaling.

#devops #aws #ecs #autoscaling

As mentioned in a previous blog post here, we can easily apply scaling using step scaling on SQS queue depth.

Step scaling policies increase or decrease the current capacity of a scalable target based on a set of scaling adjustments, known as step adjustments. The adjustments vary based on the size of the alarm breach. All alarms that are breached are evaluated by Application Auto Scaling as it receives the alarm messages.

With step scaling, you choose scaling metrics and threshold values for the CloudWatch alarms that trigger the scaling process and thus it requires you to create CloudWatch alarms.

When you create a step scaling policy, you add one or more step adjustments that enable you to scale based on the size of the alarm breach. Each step adjustment specifies the following:

A lower bound for the metric value
An upper bound for the metric value
The amount by which to scale, based on the scaling adjustment type

Application Auto Scaling supports the following adjustment types for step scaling policies:

ChangeInCapacity—Increase or decrease the current capacity of the scalable target by the specified value. A positive value increases the capacity and a negative value decreases the capacity. For example: If the current capacity is 3 and the adjustment is 5, then Application Auto Scaling adds 5 to the capacity for a total of 8.

ExactCapacity—Change the current capacity of the scalable target to the specified value. Specify a positive value with this adjustment type. For example: If the current capacity is 3 and the adjustment is 5, then Application Auto Scaling changes the capacity to 5.

PercentChangeInCapacity—Increase or decrease the current capacity of the scalable target by the specified percentage. A positive value increases the capacity and a negative value decreases the capacity. For example: If the current capacity is 10 and the adjustment is 10 percent, then Application Auto Scaling adds 1 to the capacity for a total of 11.

Below is the CloudFormation template with the implementation of step scaling with SQS.

AWSTemplateFormatVersion: 2010-09-09
Description: Creates a fargate based auto-scaling environment that processes work from an SQS queue
Parameters:
  DockerImageUrl:
    Type: String
    Default: latest

  DockerContainerName:
    Type: String
    Default: consumer-service

  EnvironmentName:
    Type: String
    Default: dev

  Memory:
    Type: String
    Default: 8GB

  Cpu:
    Type: Number
    Default: 2048 # 2 vCPU

  ContainerPort:
    Type: Number
    Default: 3000

  HealthCheckPath:
    Type: String
    Default: http://localhost:3000/check

  FaragateScalingEnvSSM:
    Type: AWS::SSM::Parameter::Value<String>
    Default: "/config/ecs/FARGATE_SCALING_ENV"

  QueueDepthScaleOutAlarmThresholdSSM:
    Type: AWS::SSM::Parameter::Value<String>
    Default: "/config/ecs/consumer-service/QUEUE_DEPTH_SCALE_OUT_ALARM_THRESHOLD"

  CpuUtilizationScaleInAlarmThresholdSSM:
    Type: AWS::SSM::Parameter::Value<String>
    Default: "/config/ecs/consumer-service/CPU_UTILIZATION_SCALE_IN_ALARM_THRESHOLD"

  CpuUtilizationNoComputeOrScaleInAlarmEvaluationPeriodsSSM:
    Type: AWS::SSM::Parameter::Value<String>
    Default: "/config/ecs/consumer-service/CPU_UTILIZATION_NO_COMPUTE_OR_SCALE_IN_ALARM_EVALUATION_PERIODS"

  ComputeAutoScalingTargetMaxCapacitySSM:
    Type: AWS::SSM::Parameter::Value<String>
    Default: "/config/ecs/consumer-service/AUTO_SCALING_TARGET_MAX_CAPACITY"

Conditions:
  CreateNonProdResources: !Equals [!Ref FaragateScalingEnvSSM, 'non-prod']
  CreateProdResources: !Equals [!Ref FaragateScalingEnvSSM, 'prod']

Resources:
  SQSQueue:
    Type: 'AWS::SQS::Queue'
    # Properties:
    #   ReceiveMessageWaitTimeSeconds: 20
    #   VisibilityTimeout: 1200 # 20 minutes
    #   MessageRetentionPeriod: 1209600 # 14 Days

  QueueUrlParameter:
    Type: 'AWS::SSM::Parameter'
    Properties:
      Name: !Join
        - ''
        - - /
          - !Ref EnvironmentName
          - /services/
          - !Ref DockerContainerName
          - /SQS_QUEUE_URL
      Type: String
      Value: !Ref SQSQueue

  ComputeTaskLogGroup:
    Type: 'AWS::Logs::LogGroup'
    Properties:
      LogGroupName: !Join
        - /
        - - /x-org
          - ecs
          - !Sub '${AWS::StackName}'
          - logs

  ComputeTaskRole:
    Type: 'AWS::IAM::Role'
    Properties:
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - ecs-tasks.amazonaws.com
            Action:
              - 'sts:AssumeRole'
      Policies:
        - PolicyName: Required_Access
          PolicyDocument:
            Version: 2012-10-17
            Statement:
              - Effect: Allow
                Action:
                  - 'sqs:*'
                  - 'secretsmanager:*'
                  - 'ssm:*'
                  - 'logs:*'
                  - 'dynamodb:*'
                  - 's3:*'
                  - 'ecs:*'
                Resource: '*'
      ManagedPolicyArns:
        - 'arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy'

  ComputeTaskDefinition:
    Type: 'AWS::ECS::TaskDefinition'
    DependsOn: ComputeTaskLogGroup
    Properties:
      TaskRoleArn: !GetAtt ComputeTaskRole.Arn
      ExecutionRoleArn: !GetAtt ComputeTaskRole.Arn
      RequiresCompatibilities:
        - FARGATE
      NetworkMode: awsvpc
      Cpu: !Ref Cpu
      Memory: !Ref Memory
      ContainerDefinitions:
        - Name: !Sub '${AWS::StackName}'
          Image: !Ref DockerImageUrl
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-region: us-east-1
              awslogs-group: !Ref ComputeTaskLogGroup
              awslogs-stream-prefix: ecs
          HealthCheck:
            Command:
              - CMD-SHELL
              - !Sub 'curl -f ${HealthCheckPath} || exit 1'
            Interval: 30
            Retries: 3
            StartPeriod: 300
          PortMappings:
            - ContainerPort: !Ref ContainerPort
              Protocol: tcp
          Environment:
            - Name: EnvironmentName
              Value: !Ref EnvironmentName
            - Name: SQS_QUEUE_URL
              Value: !Ref SQSQueue

  ComputeCluster:
    Type: 'AWS::ECS::Cluster'
    # Properties:
    #   ClusterName: !Join ['-', [!Ref DockerContainerName, cluster]]

  NonProdComputeService:
    Type: 'AWS::ECS::Service'
    Condition: CreateNonProdResources # only create if it is NonProd env
    Properties:
      Cluster: !Ref ComputeCluster
      TaskDefinition: !Ref ComputeTaskDefinition
      DeploymentConfiguration:
        MinimumHealthyPercent: 100
        MaximumPercent: 200
      # Desired count should be 0; Otherwise the Task Scheduler will restart number of desired containers once they are stopped
      DesiredCount: 0
      # This may need to be adjusted if the container takes a while to start up
      # HealthCheckGracePeriodSeconds: 30
      LaunchType: FARGATE
      NetworkConfiguration:
        AwsvpcConfiguration:
          # Change it to DISABLED if you're using private subnets that have access to a NAT gateway
          AssignPublicIp: ENABLED
          Subnets:
            - !ImportValue ComputeSubnetA
            - !ImportValue ComputeSubnetB
            - !ImportValue ComputeSubnetC
          SecurityGroups:
            - !ImportValue ComputeSecurityGroup

  ProdComputeService:
    Type: 'AWS::ECS::Service'
    Condition: CreateProdResources # only create if it is Prod env
    Properties:
      Cluster: !Ref ComputeCluster
      TaskDefinition: !Ref ComputeTaskDefinition
      DeploymentConfiguration:
        MinimumHealthyPercent: 100
        MaximumPercent: 200
      DesiredCount: 1
      # This may need to be adjusted if the container takes a while to start up
      # HealthCheckGracePeriodSeconds: 30
      LaunchType: FARGATE
      NetworkConfiguration:
        AwsvpcConfiguration:
          # Change it to DISABLED if you're using private subnets that have access to a NAT gateway
          AssignPublicIp: ENABLED
          Subnets:
            - !ImportValue ComputeSubnetA
            - !ImportValue ComputeSubnetB
            - !ImportValue ComputeSubnetC
          SecurityGroups:
            - !ImportValue ComputeSecurityGroup

  ComputeAutoScalingRole:
    Type: 'AWS::IAM::Role'
    Properties:
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - ecs-tasks.amazonaws.com
                - application-autoscaling.amazonaws.com
            Action:
              - 'sts:AssumeRole'
      Path: "/"
      Policies:
      - PolicyName: !Sub ${DockerContainerName}-ECSAutoScalingRole
        PolicyDocument:
          Statement:
          - Effect: Allow
            Action:
            - ecs:UpdateService
            - ecs:DescribeServices
            - application-autoscaling:*
            - cloudwatch:DescribeAlarms
            - cloudwatch:GetMetricStatistics
            Resource: "*"
      ManagedPolicyArns:
        - 'arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceAutoscaleRole'

  NonProdComputeAutoScalingTarget:
    Type: 'AWS::ApplicationAutoScaling::ScalableTarget'
    Condition: CreateNonProdResources # only create if it is NonProd env
    Properties:
      MinCapacity: 0 # As desired task can be 0
      MaxCapacity: !Ref ComputeAutoScalingTargetMaxCapacitySSM
      ResourceId: !Join
          - '/'
          - - service
            - !Ref ComputeCluster
            - !GetAtt NonProdComputeService.Name
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
      RoleARN: !GetAtt ComputeAutoScalingRole.Arn

  ProdComputeAutoScalingTarget:
    Type: 'AWS::ApplicationAutoScaling::ScalableTarget'
    Condition: CreateProdResources # only create if it is Prod env
    Properties:
      MinCapacity: 1 # As desired task can be 1 but not 0
      MaxCapacity: !Ref ComputeAutoScalingTargetMaxCapacitySSM 
      ResourceId: !Join
          - '/'
          - - service
            - !Ref ComputeCluster
            - !GetAtt ProdComputeService.Name
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
      RoleARN: !GetAtt ComputeAutoScalingRole.Arn

  # https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html
  NonProdNoComputeAutoScalingPolicy:
    Type: 'AWS::ApplicationAutoScaling::ScalingPolicy'
    Condition: CreateNonProdResources # only create if it is NonProd env
    Properties:
      PolicyName: !Sub ${DockerContainerName}-NonProdNoComputeAutoScalingPolicy
      PolicyType: StepScaling
      ScalingTargetId: !Ref NonProdComputeAutoScalingTarget
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
      StepScalingPolicyConfiguration:
        AdjustmentType: ExactCapacity # Can use PercentChangeInCapacity but then need to come up with configuration including some estimated change in percent
        Cooldown: 60
        MetricAggregationType: Average # Valid values are Minimum, Maximum, and Average. If the aggregation type is null, the value is treated as Average. 
        StepAdjustments: 
        - MetricIntervalLowerBound: !Ref AWS::NoValue
          MetricIntervalUpperBound: 0
          ScalingAdjustment: 0

  # https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html
  NonProdInitialComputeAutoScalingPolicy:
    Type: 'AWS::ApplicationAutoScaling::ScalingPolicy'
    Condition: CreateNonProdResources # only create if it is NonProd env
    Properties:
      PolicyName: !Sub ${DockerContainerName}-NonProdInitialComputeAutoScalingPolicy
      PolicyType: StepScaling
      ScalingTargetId: !Ref NonProdComputeAutoScalingTarget
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
      StepScalingPolicyConfiguration:
        AdjustmentType: ChangeInCapacity # ChangeInCapacity —> Increase or decrease the current capacity of the scalable target by the specified value
        Cooldown: 60 # 1 min delay 
        MetricAggregationType: Minimum # Valid values are Minimum, Maximum, and Average. If the aggregation type is null, the value is treated as Average.
        StepAdjustments: 
        - MetricIntervalLowerBound: 0
          MetricIntervalUpperBound: !Ref AWS::NoValue
          ScalingAdjustment: 1 # scaling up by 1 container when the alarm is greater than or equal to the Metric Threshold

  # https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html
  NonProdComputeAutoScalingScaleOutPolicy:
    Type: 'AWS::ApplicationAutoScaling::ScalingPolicy'
    Condition: CreateNonProdResources # only create if it is NonProd env
    Properties:
      PolicyName: !Sub ${DockerContainerName}-NonProdComputeAutoScalingScaleOutPolicy
      PolicyType: StepScaling
      ScalingTargetId: !Ref NonProdComputeAutoScalingTarget
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
      StepScalingPolicyConfiguration:
        AdjustmentType: ChangeInCapacity # ChangeInCapacity —> Increase or decrease the current capacity of the scalable target by the specified value
        Cooldown: 60 # 1 min delay
        MetricAggregationType: Minimum # Valid values are Minimum, Maximum, and Average. If the aggregation type is null, the value is treated as Average. 
        StepAdjustments: 
        - MetricIntervalLowerBound: 0  # 0 means exactly equal to Metric Threshold which is 10 defined using SSM parameter
          MetricIntervalUpperBound: 15 # [Metrice Threshold + 15]
          ScalingAdjustment: 1
        - MetricIntervalLowerBound: 15 # [Metrice Threshold + 15]
          MetricIntervalUpperBound: 25 # [Metrice Threshold + 25]
          ScalingAdjustment: 1
        - MetricIntervalLowerBound: 25 # [Metrice Threshold + 25]
          MetricIntervalUpperBound: 35 # [Metrice Threshold + 35]
          ScalingAdjustment: 1
        - MetricIntervalLowerBound: 35 # [Metrice Threshold + 35]
          MetricIntervalUpperBound: 45 # [Metrice Threshold + 45]
          ScalingAdjustment: 1
        - MetricIntervalLowerBound: 45 # [Metrice Threshold + 45]
          ScalingAdjustment: 1

  # https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html
  ProdComputeAutoScalingScaleInPolicy:
    Type: 'AWS::ApplicationAutoScaling::ScalingPolicy'
    Condition: CreateProdResources # only create if it is Prod env
    Properties:
      PolicyName: !Sub ${DockerContainerName}-ProdComputeAutoScalingScaleInPolicy
      PolicyType: StepScaling
      ScalingTargetId: !Ref ProdComputeAutoScalingTarget
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
      StepScalingPolicyConfiguration:
        AdjustmentType: ExactCapacity # Can use PercentChangeInCapacity but then need to come up with configuration including some estimated change in percent
        Cooldown: 60
        MetricAggregationType: Average # Valid values are Minimum, Maximum, and Average. If the aggregation type is null, the value is treated as Average.
        StepAdjustments: 
        - MetricIntervalLowerBound: !Ref AWS::NoValue
          MetricIntervalUpperBound: 0
          ScalingAdjustment: 1

  # https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html
  ProdComputeAutoScalingScaleOutPolicy:
    Type: 'AWS::ApplicationAutoScaling::ScalingPolicy'
    Condition: CreateProdResources # only create if it is Prod env
    Properties:
      PolicyName: !Sub ${DockerContainerName}-ProdComputeAutoScalingScaleOutPolicy
      PolicyType: StepScaling
      ScalingTargetId: !Ref ProdComputeAutoScalingTarget
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
      StepScalingPolicyConfiguration:
        AdjustmentType: ChangeInCapacity # ChangeInCapacity —> Increase or decrease the current capacity of the scalable target by the specified value
        Cooldown: 60 # 1 min delay
        MetricAggregationType: Minimum # Valid values are Minimum, Maximum, and Average. If the aggregation type is null, the value is treated as Average.
        StepAdjustments: 
        - MetricIntervalLowerBound: 0  # 0 means exactly equal to Metric Threshold which is 10 defined using SSM parameter
          MetricIntervalUpperBound: 15 # [Metrice Threshold + 15]
          ScalingAdjustment: 1
        - MetricIntervalLowerBound: 15 # [Metrice Threshold + 15]
          MetricIntervalUpperBound: 25 # [Metrice Threshold + 25]
          ScalingAdjustment: 1
        - MetricIntervalLowerBound: 25 # [Metrice Threshold + 25]
          MetricIntervalUpperBound: 35 # [Metrice Threshold + 35]
          ScalingAdjustment: 1
        - MetricIntervalLowerBound: 35 # [Metrice Threshold + 35]
          MetricIntervalUpperBound: 45 # [Metrice Threshold + 45]
          ScalingAdjustment: 1
        - MetricIntervalLowerBound: 45 # [Metrice Threshold + 45]
          ScalingAdjustment: 1

# ######### https://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch-metrics.html #########
#                                       (Total CPU units used by tasks in service) x 100
# Service CPU utilization =  ----------------------------------------------------------------------------
#                            (Total CPU units specified in task definition) x (number of tasks in service)

  NonProdCPUNoComputeAlarm:
    Type: AWS::CloudWatch::Alarm
    Condition: CreateNonProdResources # only create alarm if it is NonProd env
    Properties:
      AlarmName: !Sub ${DockerContainerName}-NonProdCPUNoComputeAlarm
      AlarmDescription: Alarm if container utilize low CPU based on specified threshold!
      Namespace: AWS/ECS # AWS::CloudWatch::Alarm.Period >= 60 for metrics in the AWS/ namespace
      MetricName: CPUUtilization
      Dimensions:
        - Name: ServiceName
          Value:
            Fn::GetAtt:
            - NonProdComputeService
            - Name
        - Name: ClusterName
          Value:
            Ref: ComputeCluster
      Statistic: Average # Not using Sum since the metric is CPUUtilization
      Period: 60  # 60 seconds ( Period must be 10, 30 or a multiple of 60 but 10 and 30 can not be used with namespaces with the following prefix: AWS/ )
      EvaluationPeriods: !Ref CpuUtilizationNoComputeOrScaleInAlarmEvaluationPeriodsSSM # setting evaluation period 3 because when there is no task at all usually cpu starts with 0 (this way first evaluation period will already hit even before the task start doing anything) & 2 more as part of taking extra precautions! Change it to 2 if needed but not 1
      Threshold: !Ref CpuUtilizationScaleInAlarmThresholdSSM
      ComparisonOperator: LessThanOrEqualToThreshold
      AlarmActions:
        - Ref: NonProdNoComputeAutoScalingPolicy

  NonProdInitialQueueDepthAlarm:
    Type: AWS::CloudWatch::Alarm
    Condition: CreateNonProdResources # only create alarm if it is NonProd env
    Properties:
      AlarmName: !Sub ${DockerContainerName}-NonProdInitialQueueDepthAlarm
      AlarmDescription: Alarm if queue depth grows beyond specified threshold!
      Namespace: AWS/SQS # AWS::CloudWatch::Alarm.Period >= 60 for metrics in the AWS/ namespace
      MetricName: ApproximateNumberOfMessagesVisible
      Dimensions:
        - Name: QueueName
          Value : !GetAtt SQSQueue.QueueName
      Statistic: Sum
      Period: 60 # 60 seconds ( Period must be 10, 30 or a multiple of 60 but 10 and 30 can not be used with namespaces with the following prefix: AWS/ )
      EvaluationPeriods: 1
      Threshold: 1 # Threshold is 1 for initial depth
      ComparisonOperator: GreaterThanOrEqualToThreshold
      AlarmActions:
        - Ref: NonProdInitialComputeAutoScalingPolicy

  NonProdQueueDepthScaleOutAlarm:
    Type: AWS::CloudWatch::Alarm
    Condition: CreateNonProdResources # only create alarm if it is NonProd env
    Properties:
      AlarmName: !Sub ${DockerContainerName}-NonProdQueueDepthScaleOutAlarm
      AlarmDescription: Alarm if queue depth grows beyond specified threshold!
      Namespace: AWS/SQS # AWS::CloudWatch::Alarm.Period >= 60 for metrics in the AWS/ namespace
      MetricName: ApproximateNumberOfMessagesVisible
      Dimensions:
        - Name: QueueName
          Value : !GetAtt SQSQueue.QueueName
      Statistic: Sum
      Period: 120 # 120 seconds ( Period must be 10, 30 or a multiple of 60 but 10 and 30 can not be used with namespaces with the following prefix: AWS/ )
      EvaluationPeriods: 1
      Threshold: !Ref QueueDepthScaleOutAlarmThresholdSSM # change this as needed
      ComparisonOperator: GreaterThanOrEqualToThreshold
      AlarmActions:
        - Ref: NonProdComputeAutoScalingScaleOutPolicy

  ProdCPUScaleInAlarm:
    Type: AWS::CloudWatch::Alarm
    Condition: CreateProdResources # only create alarm if it is Prod env
    Properties:
      AlarmName: !Sub ${DockerContainerName}-ProdCPUScaleInAlarm
      AlarmDescription: Alarm if container utilize low cpu based on specified threshold!
      Namespace: AWS/ECS # AWS::CloudWatch::Alarm.Period >= 60 for metrics in the AWS/ namespace
      MetricName: CPUUtilization
      Dimensions:
        - Name: ServiceName
          Value:
            Fn::GetAtt:
            - ProdComputeService
            - Name
        - Name: ClusterName
          Value:
            Ref: ComputeCluster
      Statistic: Average # Not using Sum since the metric is CPUUtilization
      Period: 60  # 60 seconds ( Period must be 10, 30 or a multiple of 60 but 10 and 30 can not be used with namespaces with the following prefix: AWS/ )
      EvaluationPeriods: !Ref CpuUtilizationNoComputeOrScaleInAlarmEvaluationPeriodsSSM # setting this 3 as part of taking extra precautions! Change it to 1 if needed
      Threshold: !Ref CpuUtilizationScaleInAlarmThresholdSSM 
      ComparisonOperator: LessThanOrEqualToThreshold
      AlarmActions:
        - Ref: ProdComputeAutoScalingScaleInPolicy

  ProdQueueDepthScaleOutAlarm:
    Type: AWS::CloudWatch::Alarm
    Condition: CreateProdResources # only create alarm if it is Prod env
    Properties:
      AlarmName: !Sub ${DockerContainerName}-ProdQueueDepthScaleOutAlarm
      AlarmDescription: Alarm if queue depth grows beyond specified threshold!
      Namespace: AWS/SQS # AWS::CloudWatch::Alarm.Period >= 60 for metrics in the AWS/ namespace
      MetricName: ApproximateNumberOfMessagesVisible
      Dimensions:
        - Name: QueueName
          Value : !GetAtt SQSQueue.QueueName
      Statistic: Sum
      Period: 120 # 120 seconds ( Period must be 10, 30 or a multiple of 60 but 10 and 30 can not be used with namespaces with the following prefix: AWS/ )
      EvaluationPeriods: 1
      Threshold: !Ref QueueDepthScaleOutAlarmThresholdSSM # change this as needed
      ComparisonOperator: GreaterThanOrEqualToThreshold
      AlarmActions:
        - Ref: ProdComputeAutoScalingScaleOutPolicy

In the above template, I have created two sets of resources NonProd and Prod. In lower environments, the DesiredCount of ECS service is set as zero to save cost. Since CloudWatch Alarm takes at least one minute to respond to scaling events and in prod we do not want any delay that's why the DesiredCount is set as one in prod.
One more thing to note is that the scale-out is based on queue depth but scale-in is based on CPU utilization because the consumer-service consumes a message which can take several minutes to finish and I don't want Application Auto-Scaling to reduce tasks when the queue is empty and they are still processing something. In this regard, EC2 has something called
instance scale-in protection which allows you to have control over which queue workers are terminated when your Auto Scaling group scales in. It was not available for Fargate based ECS clusters but AWS just recently introduced a new feature for ECS called task scale-in protection. You can see the blog post here.

Here is the bash script for creating SSM parameters.

#!/usr/bin/env bash

while (($# > 1)); do
  case $1 in
  --profile) PROFILE="$2" ;;
  *) break ;;
  esac
  shift 2
done

echo "Deleting Parameters..."
aws ssm delete-parameter --profile $PROFILE --name "/$PROFILE/services/consumer-service/AWS_REGION"
aws ssm delete-parameter --profile $PROFILE --name "/config/ecs/FARGATE_SCALING_ENV"
aws ssm delete-parameter --profile $PROFILE --name "/config/ecs/consumer-service/QUEUE_DEPTH_SCALE_OUT_ALARM_THRESHOLD"
aws ssm delete-parameter --profile $PROFILE --name "/config/ecs/consumer-service/CPU_UTILIZATION_SCALE_IN_ALARM_THRESHOLD"
aws ssm delete-parameter --profile $PROFILE --name "/config/ecs/consumer-service/CPU_UTILIZATION_NO_COMPUTE_OR_SCALE_IN_ALARM_EVALUATION_PERIODS"
aws ssm delete-parameter --profile $PROFILE --name "/config/ecs/consumer-service/AUTO_SCALING_TARGET_MAX_CAPACITY"

echo "Creating parameters..."
aws ssm put-parameter --profile $PROFILE --overwrite --cli-input-json '{"Type": "String", "Name": "/'$PROFILE'/services/consumer-service/AWS_REGION", "Value": "us-east-1"}'
aws ssm put-parameter --profile $PROFILE --overwrite --cli-input-json '{"Type": "String", "Name": "/config/ecs/FARGATE_SCALING_ENV", "Value": "non-prod"}' # valid values are: non-prod or prod
aws ssm put-parameter --profile $PROFILE --overwrite --cli-input-json '{"Type": "String", "Name": "/config/ecs/consumer-service/QUEUE_DEPTH_SCALE_OUT_ALARM_THRESHOLD", "Value": "10"}' # if you are increasing this then make sure you are also adjusting step scaling criteria
aws ssm put-parameter --profile $PROFILE --overwrite --cli-input-json '{"Type": "String", "Name": "/config/ecs/consumer-service/CPU_UTILIZATION_SCALE_IN_ALARM_THRESHOLD", "Value": "2"}' # 2% 
aws ssm put-parameter --profile $PROFILE --overwrite --cli-input-json '{"Type": "String", "Name": "/config/ecs/consumer-service/CPU_UTILIZATION_NO_COMPUTE_OR_SCALE_IN_ALARM_EVALUATION_PERIODS", "Value": "3"}' # do not set this as 1 in non-prod
aws ssm put-parameter --profile $PROFILE --overwrite --cli-input-json '{"Type": "String", "Name": "/config/ecs/consumer-service/AUTO_SCALING_TARGET_MAX_CAPACITY", "Value": "6"}' # put 1 if you want to disable autoscaling at all

DEV Community

SQS depth based ECS task auto-scaling using step scaling.

Top comments (0)