DEV Community

Jimmy Dahlqvist
Jimmy Dahlqvist

Posted on

GitHub Self hosted runners on AWS - part 2 - EC2

"GitHub Actions makes it easy to automate all your software workflows, now with world-class CI/CD."

That is how GitHub describe their built in CI/CD tooling. I must say that I really like it and this is something GitHub has been missing before. Many of the Git as a Service providers, GitLab, Bitbucket, Azure DevOps, have had a bundled CI/CD tooling. With GitHub you always needed to use an outside tool.

This is part two in the series on how to create and setup your own self hosted runner in AWS.

All code used can be found in my GitHub repo

Part one, short recap

In part one I tried and showed how to run self hosted runners using Fargate. The conclusion was that it wasn't a good match. If you have missed it you can find it here

Part two, EC2

Instead, in this part I will show how to use EC2 and Auto Scaling Groups to run and host the runners. As always we need to start by setting up a VPC. I run everything in a simple 2 public subnet VPC. The CloudFormation template used is located on GitHub

Automatically add and register a runner

Create a Auto Scaling Group

We start by creating a Auto Scaling Group that can add and remove instances as we see fit.

  AutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      AutoScalingGroupName: github-runners-asg
      Cooldown: 300
      DesiredCapacity: 1
      MaxSize: 5
      MinSize: 0
      HealthCheckGracePeriod: 300
      HealthCheckType: EC2
      LaunchConfigurationName: !Ref LaunchConfiguration
      VPCZoneIdentifier:
        - Fn::ImportValue: !Sub ${VpcStackName}:PublicSubnet1
        - Fn::ImportValue: !Sub ${VpcStackName}:PublicSubnet2
Enter fullscreen mode Exit fullscreen mode

We set the desired capacity to one instance with possibility to scale up to five instances.

Create a Launch Template

For a EC2 instance to install and setup everything when started from an Auto Scaling Group we create a Launch Template. We use the User Data part of the Launch Template to install and register the runner. This way every time a new Instance is created by the AWS Auto Scaling Group (ASG) the instance will register with GitHub.

  LaunchConfiguration:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      ImageId: !Ref EC2ImageId
      InstanceType: t3.micro
      IamInstanceProfile: !GetAtt EC2InstanceProfile.Arn
      KeyName: !Ref SSHKeyName
      SecurityGroups:
        - !Ref SecurityGroup
      UserData:
        Fn::Base64:
          Fn::Sub: |
            #!/bin/bash -xe
            yum update -y
            yum install docker -y
            yum install git -y
            yum install jq -y 
            sudo usermod -a -G docker ec2-user
            sudo systemctl start docker
            sudo systemctl enable docker
            export RUNNER_ALLOW_RUNASROOT=true
            mkdir actions-runner
            cd actions-runner
            curl -O -L https://github.com/actions/runner/releases/download/v2.262.1/actions-runner-linux-x64-2.262.1.tar.gz
            tar xzf ./actions-runner-linux-x64-2.262.1.tar.gz
            PAT=<Super Secret PAT>
            token=$(curl -s -XPOST \
                -H "authorization: token $PAT" \
                https://api.github.com/repos/<GitHub_User>/<GitHub_Repo>/actions/runners/registration-token |\
                jq -r .token)
            sudo chown ec2-user -R /actions-runner
            ./config.sh --url https://github.com/<GitHub_User>/<GitHub_Repo> --token $token --name "my-runner-$(hostname)" --work _work
            sudo ./svc.sh install
            sudo ./svc.sh start
            sudo chown ec2-user -R /actions-runner
Enter fullscreen mode Exit fullscreen mode

Once again we need to set RUNNER_ALLOW_RUNASROOT to true since the User Script is run as root. When the instance has started, and registered with GitHub, it's ready to start serving build jobs.

Automatically remove a runner

Automatically registering and starting to serve jobs is just one part in the chain. If an instance is removed by the auto scaling group we also want it to remove it self from the pool of runners in GitHub. To do that we tie a couple of features and services together.

Life Cycle Hooks

To get notified when an instance is removed and have the possibility to pause the termination process we use Life Cycle Hooks in the auto scaling group. This way the instance will go into a pending state giving us the possibility to run scripts to remove it as runner before it terminates.

ASG Hooks

  TerminateLifecycleHook:
    Type: AWS::AutoScaling::LifecycleHook
    Properties:
      AutoScalingGroupName: !Ref AutoScalingGroup
      LifecycleTransition: autoscaling:EC2_INSTANCE_TERMINATING
Enter fullscreen mode Exit fullscreen mode

AWS Systems Manager

When an instance enters the Terminating:wait state we like to run a script on the instance. To do that we use AWS Systems Manager Documents and then send a command to SSM agent to run the Document.

  RemoveDocument:
    Type: AWS::SSM::Document
    Properties:
      DocumentType: Command
      Tags:
        - Key: Name
          Value: github-actions-install-register-runner
      Content:
        schemaVersion: "2.2"
        description: Command Document de-register GitHub Actions Runner
        mainSteps:
          - action: "aws:runShellScript"
            name: "deregister"
            inputs:
              runCommand:
                - "cd /actions-runner"
                - "sudo ./svc.sh stop"
                - "sudo ./svc.sh uninstall"
                - "PAT=<Super Secret PAT>"
                - 'token=$(curl -s -XPOST -H "authorization: token $PAT" https://api.github.com/repos/<GitHub_User>/<GitHub_Repo>/actions/runners/remove-token | jq -r .token)'
                - 'su ec2-user -c "./config.sh remove --token $token"'
Enter fullscreen mode Exit fullscreen mode

The Document will run a shell script. We stop and uninstall the GitHub runner service. Fetch a remove token and removes the runner from GitHub.

The Document will be run as root, and since the remove command ignore RUNNER_ALLOW_RUNASROOT flag we make sure we run as the EC2 user instead. This is really important. Remove will not work otherwise. That is why we give EC2-User full access to the runners folder when we install and start the service in the Launch Configuration.

EventBridge

To get notified when a Lifecycle event happens it is possible to use SNS. However I decided to use EventBridge instead for many reasons. Primarily due to EventBridge support more endpoints.

So I setup an Events::Rule to detect the change and trigger a Lambda function that will run the SSM Document.

  TerminatingRule:
    Type: AWS::Events::Rule
    Properties:
      EventPattern: !Sub |
        {
          "source": [
            "aws.autoscaling"
          ],
          "detail-type": [
            "EC2 Instance-terminate Lifecycle Action"
          ]
        }
      Targets:
        - Arn: !GetAtt LifeCycleHookTerminatingFunction.Arn
          Id: target
Enter fullscreen mode Exit fullscreen mode

Lambda

When the Lifecycle hook is detected by EventBridge a Lambda function will be triggered. The Lambda function will call SSM RunCommand API to run the SSM Document. As discussed earlier the Document will run a script that will call GitHub and remove the runner.

When the Document is done running we need to notify the ASG to continue terminating the instance.

  LifeCycleHookTerminatingFunction:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: github-runners-asg-lifecycle-hook-terminate
      Runtime: python3.6
      MemorySize: 256
      Timeout: 30
      CodeUri: ./lambdas
      Handler: terminate.handler
      Role: !GetAtt LifeCycleHookTerminatingFunctionRole.Arn
      Environment:
        Variables:
          SSM_DOCUMENT_NAME: !Ref RemoveDocument
Enter fullscreen mode Exit fullscreen mode

As said we need to call the SSM RunCommand API and call the ASG API to continue the termination.

def handler(event, context):
    message = event['detail']
    if LIFECYCLE_KEY in message and ASG_KEY in message:
        life_cycle_hook = message[LIFECYCLE_KEY]
        auto_scaling_group = message[ASG_KEY]
        instance_id = message[EC2_KEY]
        ssm_document = os.environ[SSM_DOCUMENT_KEY]
        success = run_ssm_command(ssm_document, instance_id)
        result = 'CONTINUE'
        if not success:
            result = 'ABANDON'
        notify_lifecycle(life_cycle_hook, auto_scaling_group,
                         instance_id, result)
    return {}

def run_ssm_command(ssm_document, instance_id):
    ssm_client = boto3.client('ssm')
    try:
        instances = [str(instance_id)]
        response = ssm_client.send_command(DocumentName=ssm_document,
                                           InstanceIds=instances,
                                           Comment='Remove GitHub Runner',
                                           TimeoutSeconds=1200)
    except Exception as e:
        return False
    return True

def notify_lifecycle(life_cycle_hook, auto_scaling_group, instance_id, result):
    asg_client = boto3.client('autoscaling')
    try:
        asg_client.complete_lifecycle_action(
            LifecycleHookName=life_cycle_hook,
            AutoScalingGroupName=auto_scaling_group,
            LifecycleActionResult=result,
            InstanceId=instance_id
        )
    except Exception as e:
        logger.error(
            "Lifecycle hook notified could not be executed: %s", str(e))
        raise e

Enter fullscreen mode Exit fullscreen mode

Conclusion

Running the GitHub self hosted runners on EC2 instances in an Auto Scaling Group with LifeCycle Hooks for removal works really well. We can now add and remove instances in the ASG and have them register and remove them self.

But I'm still not happy. What if we get a spike in number of jobs and the queue grow? Even though we run in a ASG we still doesn't auto scale.

Time to throw auto scaling into the pot.... Stay tuned for part 3.

Code

All code in this blog series can be found on GitHub

Top comments (3)

Collapse
 
akshaykrjain profile image
Akshay Jain

Any updates on Part .3 ?

Collapse
 
axel_fontaine_1c40fe93080 profile image
Axel Fontaine

We simplified all this for our users at sprinters.sh . No need for auto-scaling or idle capacity. Each job automatically gets its own fast-booting ephemeral runner on EC2 spot without having to maintain any orchestration infrastructure.

Collapse
 
andrewdibiasio6 profile image
andrewdibiasio6

I would love to see how you auto scale this!