wayofthepie

Posted on Feb 8, 2020 • Edited on Feb 13, 2020

Improving Our GitHub Actions Runner Orchestrator

#github #kubernetes #bash #docker

Github checks api
- Making use of this new information
Using kubernetes jobs to schedule actions runners
- Storing our token properly
Running our script as a kubernetes cronjob
- Assigning the correct permissions
Fixing a bug in our logic
Conclusion

In the last post we got a simple actions runner orchestrator running with bash and cron. We also noted a few issues with that version. In this post we will fix up the following issues:

Instead of launching a runner per commit, we will instead launch a runner per check request.
Instead of running local docker containers, we will run kubernetes jobs.
Instead of just running locally with cron, we will create a kubernetes CronJob.

Let's do it!

Github checks api

When a check run is requested, the github checks api will reflect this. For example:

$ curl -s \
    -H "accept: application/vnd.github.antiope-preview+json" \
    -H "authorization: token ${PAT}" \
    https://api.github.com/repos/wayofthepie/gh-app-test/commits/e13119b0/check-runs
{
  "total_count": 1,
  "check_runs": [
    {
      "id": 433544203,
      "node_id": "MDg6Q2hlY2tSdW40MzM1NDQyMDM=",
      "head_sha": "e13119b07d81e4c587882b2f7c9d7a730810f709",
      "external_id": "ca395085-040a-526b-2ce8-bdc85f692774",
      "url": "https://api.github.com/repos/wayofthepie/gh-app-test/check-runs/433544203",
      "html_url": "https://github.com/wayofthepie/gh-app-test/runs/433544203",
      "details_url": "https://help.github.com/en/actions",
      "status": "queued",
      "conclusion": null,
      "started_at": "2020-02-08T14:40:27Z",
      "completed_at": null,
...

This will return the status of all the check runs for the given commit (above in the url e13119b0 is the short ref for the given commit). As you can see above the status of the first check run is queued, in this case it means it's awaiting a runner to be executed on.

Using this information will also allow our script to be completely stateless. In the previous post it had to keep track of the last commit in a file, with the checks API we no longer need this.

Making use of this new information

Now we can change the logic in our orchestration script as follows:

Get the latest commit.
Get all requested check runs for that commit.
For each requested run launch an actions runner.

Here is the updated script:

#!/usr/bin/env bash

PAT=$1
OWNER=$2
REPO=$3

# make sure we have values for all our arguments
[ -z ${PAT} ] || [ -z ${OWNER} ] || [ -z $REPO ] && {
    echo "Incorrect usage, example: ./orc.sh personal-access-token owner some-repo"
    exit 1
}

# get the latest commit
latest_commit=$(curl -s -H "authorization: token ${PAT}"  \
    https://api.github.com/repos/${OWNER}/${REPO}/commits |\
    jq -r .[0].sha)

# for each check run requested for this commit, get the "status"
# field and assign to the "check_status" variable 
for check_status in $(curl -s \
    -H "accept: application/vnd.github.antiope-preview+json" \
    -H "authorization: token ${PAT}"\
    https://api.github.com/repos/${OWNER}/${REPO}/commits/${latest_commit}/check-runs \
    | jq -r '.check_runs[] | "\(.status)"'); do

    # if "check_status" is queued launch an action runner
    if [ "${check_status}" == "queued" ]; then
        echo "Launching actions runner ..."
        docker run -d --rm actions-image \
            ${OWNER} \
            ${REPO} \
            ${PAT} \
            $(uuidgen)
    fi
done

The code up to this point can be found here.

Add a new commit to the repository you have been running the actions against and run ./orc.sh ${PAT} ${OWNER} ${REPO}, it should start a container and run the build.

Using kubernetes jobs to schedule actions runners

If we move to using kubernetes instead of our local docker daemon we can scale out much easier. First let's update to launch the actions runners as kubernetes jobs instead of direct docker containers.

First let's create a cluster. There are many ways to do this, I use Google Cloud so I'm going to create a cluster on google cloud. This has a cost, see https://kubernetes.io/docs/setup/ for examples of local cluster setups.

To create a cluster on google cloud:

$ gcloud container clusters create actions-spawner --region europe-west2-c --num-nodes 1
WARNING: Currently VPC-native is not ...
WARNING: Newly created clusters ...
...
Creating cluster actions-spawner in europe-west2-c... Cluster is being health-checked (master is healthy)...done.
Created [https://container.googleapis.com/v1/projects/monthly-hacking/zones/europe-west2-c/clusters/actions-spawner].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/europe-west2-c/actions-spawner?project=monthly-
hacking
kubeconfig entry generated for actions-spawner.
NAME             LOCATION        MASTER_VERSION  MASTER_IP     MACHINE_TYPE   NODE_VERSION    NUM_NODES  STATUS
actions-spawner  europe-west2-c  1.13.11-gke.23  35.189.78.16  n1-standard-1  1.13.11-gke.23  1          RUNNING

This will create a cluster with one node, we don't need more than that to test. Make sure you have kubectl installed. Auth should be setup automatically for kubectl. To test let's try to list the nodes:

$ kubectl get nodes
NAME                                             STATUS   ROLES    AGE     VERSION
gke-actions-spawner-default-pool-f1380f72-3wm5   Ready    <none>   4m47s   v1.13.11-gke.23

Looks good! Now, let's update our orchestration script:

#!/usr/bin/env bash

PAT=$1
OWNER=$2
REPO=$3

# make sure we have values for all our arguments
[ -z ${PAT} ] || [ -z ${OWNER} ] || [ -z $REPO ] && {
    echo "Incorrect usage, example: ./orc.sh personal-access-token owner some-repo"
    exit 1
}

# get the latest commit
latest_commit=$(curl -s -H "authorization: token ${PAT}"  \
    https://api.github.com/repos/${OWNER}/${REPO}/commits |\
    jq -r .[0].sha)

# for each check run requested for this commit, get the "status"
# field and assign to the "check_status" variable
for check_status in $(curl -s \
    -H "accept: application/vnd.github.antiope-preview+json" \
    -H "authorization: token ${PAT}"\
    https://api.github.com/repos/${OWNER}/${REPO}/commits/${latest_commit}/check-runs \
    | jq -r '.check_runs[] | "\(.status)"'); do

    # if "check_status" is queued launch an action runner
    if [ "${check_status}" == "queued" ]; then
        echo "Found check run request with status ${check_status}, launching job ..."
        cat job.yaml \
            | sed -r "s/NAME/$(uuidgen)/; s/OWNER/${OWNER}/; s/REPO/${REPO}; s/TOKEN/${TOKEN}" \
            | kubectl apply -f -
    else
        echo "Found check run request with status '${check_status}', nothing to do ..."
    fi
done

And create the job.yaml, the specification for our kubernetes job:

apiVersion: batch/v1
kind: Job
metadata:
  name: {NAME}
spec:
  template:
    spec:
      containers:
      - name: {NAME}
        image: wayofthepie/actions-image
        args: ["{OWNER}", "{REPO}", "{TOKEN}"]
      restartPolicy: Never
  backoffLimit: 4

⚠️ WARNING ⚠️
The token here should be a stored as a kubernetes secret. Using the token as I have above is not good practice. I will fix this later in this post.

Let's test it:

$ ./orc.sh ${PAT} ${OWNER} ${REPO}
Found check run request with status 'completed', nothing to do ...

Great, it works when there are no runs requested. Commit to the repo you are testing against and run again:

$ ./orc.sh ${PAT} ${OWNER} ${REPO}
Found check run request with status queued, launching job ...
job.batch/990a0d3d-bb98-419e-abc5-ca4fa48ca328 created

Looks like it worked. Let's see what's running:

$ kubectl get jobs
NAME                                   COMPLETIONS   DURATION   AGE
990a0d3d-bb98-419e-abc5-ca4fa48ca328   0/1           5s         5s

$ kubectl get pods
NAME                                         READY   STATUS      RESTARTS   AGE
990a0d3d-bb98-419e-abc5-ca4fa48ca328-wxn5c   1/1     Running     0          8s

$ kubectl logs 990a0d3d-bb98-419e-abc5-ca4fa48ca328-wxn5c -f
Unrecognized command-line input arguments: 'name'. For usage refer to: .\config.cmd --help or ./config.sh --help

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication


√ Connected to GitHub

# Runner Registration

Enter the name of runner: [press Enter for 990a0d3d-bb98-419e-abc5-ca4fa48ca328-wxn5c]
√ Runner successfully added
√ Runner connection is good
# Runner settings


√ Settings Saved.


√ Connected to GitHub

2020-02-08 16:35:49Z: Listening for Jobs
2020-02-08 16:35:53Z: Running job: build

Great! It kicked off a build. However, notice the warning at the start:

Unrecognized command-line input arguments: 'name'. For usage refer to: .\config.cmd --help or ./config.sh --help

Something is wrong... A quick look through orc.sh and job.yaml highlights the issue - we are missing the fourth argument for the wayofthepie/actions-runner image! This sets the name of the actions runner. Let's fix it up:

apiVersion: batch/v1
kind: Job
metadata:
  name: {NAME}
spec:
  template:
    spec:
      containers:
      - name: {NAME}
        image: wayofthepie/actions-image
        # here we add the name argument
        args: ["{OWNER}", "{REPO}", "{TOKEN}", "{NAME}"] 
      restartPolicy: Never
  backoffLimit: 4

Let's commit to the test repo and run again:

$ ./orc.sh ${PAT} ${TOKEN} ${REPO}
Found check run request with status queued, launching job ...
job.batch/7abcb7a1-b1bb-4641-88af-fc4562e29bb7 created

$ kubectl get jobs
NAME                                   COMPLETIONS   DURATION   AGE
7abcb7a1-b1bb-4641-88af-fc4562e29bb7   0/1           4s         4s
990a0d3d-bb98-419e-abc5-ca4fa48ca328   1/1           46s        12m

$ kubectl get pods
NAME                                         READY   STATUS      RESTARTS   AGE
7abcb7a1-b1bb-4641-88af-fc4562e29bb7-q5jdd   1/1     Running     0          13s
990a0d3d-bb98-419e-abc5-ca4fa48ca328-wxn5c   0/1     Completed   0          12m

$ kubectl logs 7abcb7a1-b1bb-4641-88af-fc4562e29bb7-q5jdd

--------------------------------------------------------------------------------
|        ____ _ _   _   _       _          _        _   _                      |
|       / ___(_) |_| | | |_   _| |__      / \   ___| |_(_) ___  _ __  ___      |
|      | |  _| | __| |_| | | | | '_ \    / _ \ / __| __| |/ _ \| '_ \/ __|     |
|      | |_| | | |_|  _  | |_| | |_) |  / ___ \ (__| |_| | (_) | | | \__ \     |
|       \____|_|\__|_| |_|\__,_|_.__/  /_/   \_\___|\__|_|\___/|_| |_|___/     |
|                                                                              |
|                       Self-hosted runner registration                        |
|                                                                              |
--------------------------------------------------------------------------------

# Authentication


√ Connected to GitHub

# Runner Registration


√ Runner successfully added
√ Runner connection is good

# Runner settings


√ Settings Saved.


√ Connected to GitHub

2020-02-08 16:48:30Z: Listening for Jobs
2020-02-08 16:48:34Z: Running job: build
2020-02-08 16:48:52Z: Job build completed with result: Succeeded

# Runner removal


√ Runner removed successfully
√ Removed .credentials
√ Removed .runner

Great! No warnings, all working as expected. The code up to this point can be seen here.

Storing our token properly

Right now our personal access token gets stored in the definition of our job! If we retrieve our job we can see it:

$ kubectl get jobs 990a0d3d-bb98-419e-abc5-ca4fa48ca328 -o json
{
    "apiVersion": "batch/v1",
    "kind": "Job",
    "metadata": {
        ...
    },
    "spec": {
        ...
        "template": {
           ...
            "spec": {
                "containers": [
                    {
                        "args": [
                            "wayofthepie",
                            "gh-app-test",
                            "THE TOKEN IS IN HERE!!!"
                        ],
                        "image": "wayofthepie/actions-image",
                        "imagePullPolicy": "Always",
                        "name": "990a0d3d-bb98-419e-abc5-ca4fa48ca328",
                        ...
                    }
                ],
    ...
}

This is not good! We should be storing this as a kubernetes secret. Let's do that. First create a secret.yaml defining our secret:

apiVersion: v1
kind: Secret
metadata:
  name: github-token
type: Opaque
stringData:
  token: {TOKEN}

To create the secret:

$ cat secret.yaml \
    | sed -r "s/\{TOKEN\}/${YOUR_PAT}/"\
    | kubectl apply -f -
secret/github-token created

$ kubectl get secrets
NAME                  TYPE                                  DATA   AGE
...
github-token          Opaque                                1      30s

With that created let's update our job spec in job.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: {NAME}
spec:
  template:
    spec:
      containers:
      - name: {NAME}
        image: wayofthepie/actions-image
        args: ["{OWNER}", "{REPO}", "$(GITHUB_TOKEN)", "{NAME}"]
        env:
        - name: GITHUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: github-token
              key: token
      restartPolicy: Never
  backoffLimit: 4

See these docs for more on reading secret data into env vars.

Also note the syntax to reference the GITHUB_TOKEN it uses $() and not ${}, see here.

Re-run and everything should work:

$ ./orc.sh ${PAT} ${OWNER} ${REPO}
Found check run request with status queued, launching job ...
job.batch/eb1e314d-594b-4253-ae8a-74c797a2cd76 created

$ kubectl get pods
NAME                                         READY   STATUS              RESTARTS   AGE
eb1e314d-594b-4253-ae8a-74c797a2cd76-x47lr   0/1     ContainerCreating   0          3s

$ kubectl logs -f eb1e314d-594b-4253-ae8a-74c797a2cd76-x47lr
...
2020-02-08 17:41:08Z: Listening for Jobs
2020-02-08 17:41:12Z: Running job: build
2020-02-08 17:41:30Z: Job build completed with result: Succeeded

# Runner removal


√ Runner removed successfully
√ Removed .credentials
√ Removed .runner

Great! We should also clean up orc.sh:

#!/usr/bin/env bash

PAT=$1
OWNER=$2
REPO=$3

# make sure we have values for all our arguments
[ -z ${PAT} ] || [ -z ${OWNER} ] || [ -z $REPO ] && {
    echo "Incorrect usage, example: ./orc.sh personal-access-token owner some-repo"
    exit 1
}

# get the latest commit
latest_commit=$(curl -s -H "authorization: token ${PAT}"  \
    https://api.github.com/repos/${OWNER}/${REPO}/commits |\
    jq -r .[0].sha)

# for each check run requested for this commit, get the "status"
# field and assign to the "check_status" variable
for check_status in $(curl -s \
    -H "accept: application/vnd.github.antiope-preview+json" \
    -H "authorization: token ${PAT}"\
    https://api.github.com/repos/${OWNER}/${REPO}/commits/${latest_commit}/check-runs \
    | jq -r '.check_runs[] | "\(.status)"'); do

    # if "check_status" is queued launch an action runner
    if [ "${check_status}" == "queued" ]; then
        echo "Found check run request with status ${check_status}, launching job ..."
        cat job.yaml \
            # we removed the {TOKEN} replacement here
            | sed -r "s/\{NAME\}/$(uuidgen)/g; s/\{OWNER\}/${OWNER}/; s/\{REPO\}/${REPO}/" \
            | kubectl apply -f -
    else
        echo "Found check run request with status '${check_status}', nothing to do ..."
    fi
done

The code up to this point can be found here.

Running our script as a kubernetes cronjob

To run our orchestrator script as a kubernetes cronjob we first need to create a docker image:

FROM ubuntu

RUN useradd -m actions \
    && apt-get update \
    && apt-get install -y \
    curl \
    jq \
    uuid-runtime

RUN curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.17.0/bin/linux/amd64/kubectl \
    && mv kubectl /usr/local/bin \
    && chmod +x /usr/local/bin/kubectl

WORKDIR /home/actions

USER actions

COPY orc.sh .
ENTRYPOINT ["./orc.sh"]

I built this as wayofthepie/actions-orchestrator and pushed to the public docker registry. Next, let' create a CronJob spec:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: actions-orchestrator
spec:
  schedule: "*/1 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: actions-orchestrator
            image: wayofthepie/actions-orchestrator
            args: ["$(GITHUB_TOKEN)", "{OWNER}", "{REPO}"]
            env:
            - name: GITHUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: github-token
                  key: token
          restartPolicy: Never

This will run our orchestrator every minute. To create the job, replace with your own repo and owner:

$ cat cron.yaml \
    | sed -r "s/\{OWNER\}/wayofthepie/; s/\{REPO\}/gh-app-test/" \
    | kubectl apply -f -
cronjob.batch/actions-orchestrator created

$ kubectl get cronjob
NAME                   SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
actions-orchestrator   */1 * * * *   False     0        <none>          7s

$ kubectl get pods
NAME                                         READY   STATUS              RESTARTS   AGE
actions-orchestrator-1581193620-4j8jz        0/1     ContainerCreating   0          1s

$ kubectl logs actions-orchestrator-1581193620-4j8jz
Found check run request with status queued, launching job ...
Error from server (Forbidden): error when retrieving current configuration of:
...
from server for: "STDIN": jobs.batch "316b08ed-89e7-4321-a521-897c7a40fa50" is forbidden: User "system:serviceaccount:default:default" cannot get resource "
jobs" in API group "batch" in the namespace "default"

An error! Let's delete the cronjob so it doesn't keep running, kubectl delete cronjob actions-orchestrator, and investigate.

Assigning the correct permissions

It seems the default service account we get in the pod does not have access to the jobs resource. To fix this we need to create a ClusterRole and ClusterRoleBinding:

roleRef:
  kind: ClusterRole
  name: jobs-manager
  apiGroup: rbac.authorization.k8s.io
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: default
  name: jobs-manager
rules:
- apiGroups: ["batch", "extensions"]
  resources: ["jobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

And create it:

$ kubectl apply -f cluster-role.yaml
clusterrole.rbac.authorization.k8s.io/default created

Re-create our cronjob:

$ cat cron.yaml | sed -r "s/\{OWNER\}/wayofthepie/; s/\{REPO\}/gh-app-test/" | kubectl apply -f -
cronjob.batch/actions-orchestrator created

# it should run every minute
$ kubectl get pods
NAME                                    READY   STATUS    RESTARTS   AGE
actions-orchestrator-1581196020-tmbll   1/1     Running   0          4s

Great! Now if we commit to the test repo it should create a new job for the requested check runs.

Fixing a bug in our logic

We still only check the last commit for check requests meaning we can still miss requests, leaving check runs for some commits idle. This is a bug.. The real fix for this would require either a lot of API calls or using webhooks. But for now we can look at the last 5 minutes of commits rather than just the last commit. If we run the script every minute there is a much smaller chance of a missing check runs.

The updates to orc.sh are:

#!/usr/bin/env bash

PAT=$1
OWNER=$2
REPO=$3

# make sure we have values for all our arguments
[ -z ${PAT} ] || [ -z ${OWNER} ] || [ -z $REPO ] && {
    echo "Incorrect usage, example: ./orc.sh personal-access-token owner some-repo"
    exit 1
}

# get the date format in the format the github api wants
function five_minutes_ago {
    echo $(date --iso-8601=seconds --date='5 minutes ago' | awk -F'+' '{print $1}')
}

echo "Getting commits from the last 5 minutes ..."
commits=$(curl -s -H "accept: application/vnd.github.antiope-preview+json" \
    -H "authorization: token ${PAT}" \
    https://api.github.com/repos/${OWNER}/${REPO}/commits?since="$(five_minutes_ago)Z" \
    | jq -r .[].sha)

for commit in ${commits[@]}; do
    echo "Checking ${commit} for check requests ..."

    # for each check run requested for this commit, get the "status"
    # field and assign to the "check_status" variable
    for check_status in $(curl -s \
        -H "accept: application/vnd.github.antiope-preview+json" \
        -H "authorization: token ${PAT}"\
        https://api.github.com/repos/${OWNER}/${REPO}/commits/${commit}/check-runs \
        | jq -r '.check_runs[] | "\(.status)"'); do

        # if "check_status" is queued launch an action runner
        if [ "${check_status}" == "queued" ]; then
            echo "Found check run request with status ${check_status}, launching job ..."
            cat job.yaml \
                | sed -r "s/\{NAME\}/$(uuidgen)/g; s/\{OWNER\}/${OWNER}/; s/\{REPO\}/${REPO}/" \
                | kubectl apply -f -
        else
            echo "Found check run request with status '${check_status}', nothing to do ..."
        fi
    done
done

Rebuild the actions orchestrator image, push and it should all work! The image up to this point is tagged as wayofthepie/actions-orchestrator:8-2-2020.

The code up to this point can be found here.

Conclusion

We now have away of orchestrating actions runners on kubernetes. There are still a few issues however:

There is no error recovery and the error messages are pretty bad. For example if for some reason the cronjob does not run for 5+ minutes we may miss commits and check runs.
It would be much better to use webhooks here.
We currently only support watching a single repository.
Things are getting complicated with bash!

I will tackle some of these in the next post.

Top comments (6)

Niek Palm • Jun 17 '20

Nice serie of blogs about orchestrating action runners. I took a slightly order approach. I build a orchestrator with a serverless archecture that act based on the GitHub events to create action runners. Feel free to have a look on my post dev.to/npalm/scaling-github-action...

wayofthepie • Jun 18 '20 • Edited

That looks great! I wanted to end this series with something similar, but decided to not go down that path. The main reason is if you look in the actions-runner code, after registration it just polls a url like pipelines.actions.githubuserconten.... I think uses part of the AzureDevOps infrastructure.

If github documented this and made it a public API we could build agentless ephemeral runners much much easier. The check run API is good, but this would be even better. I am thinking of building something around this as a POC, if I get some time.

Still, your solution looks great also, there are quite a few great solutions out there now. I can't use lambda currently but may look at something similar, based off that, once we need to use self hosted actions. Thanks for sharing!

Emad Alashi • Oct 4 '21

Great post, I loved the narrative nature of it, well done.

I think since it's written, there have been two updates from GitHub that will make this even smoother:

1) using the --ephemeral parameter when invoking the ./configure.sh command, which will clean the runner and de-register it automatically. docs.github.com/en/actions/hosting...

2) Using the workflow_job event when a job is queued: docs.github.com/en/developers/webh...

Awesome posts.