Table of Contents
- Github checks api
- Using kubernetes jobs to schedule actions runners
- Running our script as a kubernetes cronjob
- Fixing a bug in our logic
- Conclusion
In the last post we got a simple actions runner orchestrator running with bash and cron. We also noted a few issues with that version. In this post we will fix up the following issues:
- Instead of launching a runner per commit, we will instead launch a runner per check request.
- Instead of running local docker containers, we will run kubernetes jobs.
- Instead of just running locally with cron, we will create a kubernetes
CronJob
.
Let's do it!
Github checks api
When a check run is requested, the github checks api will reflect this. For example:
$ curl -s \
-H "accept: application/vnd.github.antiope-preview+json" \
-H "authorization: token ${PAT}" \
https://api.github.com/repos/wayofthepie/gh-app-test/commits/e13119b0/check-runs
{
"total_count": 1,
"check_runs": [
{
"id": 433544203,
"node_id": "MDg6Q2hlY2tSdW40MzM1NDQyMDM=",
"head_sha": "e13119b07d81e4c587882b2f7c9d7a730810f709",
"external_id": "ca395085-040a-526b-2ce8-bdc85f692774",
"url": "https://api.github.com/repos/wayofthepie/gh-app-test/check-runs/433544203",
"html_url": "https://github.com/wayofthepie/gh-app-test/runs/433544203",
"details_url": "https://help.github.com/en/actions",
"status": "queued",
"conclusion": null,
"started_at": "2020-02-08T14:40:27Z",
"completed_at": null,
...
This will return the status of all the check runs for the given commit (above in the url e13119b0
is the short ref for the given commit). As you can see above the status
of the first check run is queued
, in this case it means it's awaiting a runner to be executed on.
Using this information will also allow our script to be completely stateless. In the previous post it had to keep track of the last commit in a file, with the checks API we no longer need this.
Making use of this new information
Now we can change the logic in our orchestration script as follows:
- Get the latest commit.
- Get all requested check runs for that commit.
- For each requested run launch an actions runner.
Here is the updated script:
#!/usr/bin/env bash
PAT=$1
OWNER=$2
REPO=$3
# make sure we have values for all our arguments
[ -z ${PAT} ] || [ -z ${OWNER} ] || [ -z $REPO ] && {
echo "Incorrect usage, example: ./orc.sh personal-access-token owner some-repo"
exit 1
}
# get the latest commit
latest_commit=$(curl -s -H "authorization: token ${PAT}" \
https://api.github.com/repos/${OWNER}/${REPO}/commits |\
jq -r .[0].sha)
# for each check run requested for this commit, get the "status"
# field and assign to the "check_status" variable
for check_status in $(curl -s \
-H "accept: application/vnd.github.antiope-preview+json" \
-H "authorization: token ${PAT}"\
https://api.github.com/repos/${OWNER}/${REPO}/commits/${latest_commit}/check-runs \
| jq -r '.check_runs[] | "\(.status)"'); do
# if "check_status" is queued launch an action runner
if [ "${check_status}" == "queued" ]; then
echo "Launching actions runner ..."
docker run -d --rm actions-image \
${OWNER} \
${REPO} \
${PAT} \
$(uuidgen)
fi
done
The code up to this point can be found here.
Add a new commit to the repository you have been running the actions against and run ./orc.sh ${PAT} ${OWNER} ${REPO}
, it should start a container and run the build.
Using kubernetes jobs to schedule actions runners
If we move to using kubernetes instead of our local docker daemon we can scale out much easier. First let's update to launch the actions runners as kubernetes jobs instead of direct docker containers.
First let's create a cluster. There are many ways to do this, I use Google Cloud so I'm going to create a cluster on google cloud. This has a cost, see https://kubernetes.io/docs/setup/ for examples of local cluster setups.
To create a cluster on google cloud:
$ gcloud container clusters create actions-spawner --region europe-west2-c --num-nodes 1
WARNING: Currently VPC-native is not ...
WARNING: Newly created clusters ...
...
Creating cluster actions-spawner in europe-west2-c... Cluster is being health-checked (master is healthy)...done.
Created [https://container.googleapis.com/v1/projects/monthly-hacking/zones/europe-west2-c/clusters/actions-spawner].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/europe-west2-c/actions-spawner?project=monthly-
hacking
kubeconfig entry generated for actions-spawner.
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
actions-spawner europe-west2-c 1.13.11-gke.23 35.189.78.16 n1-standard-1 1.13.11-gke.23 1 RUNNING
This will create a cluster with one node, we don't need more than that to test. Make sure you have kubectl installed. Auth should be setup automatically for kubectl
. To test let's try to list the nodes:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
gke-actions-spawner-default-pool-f1380f72-3wm5 Ready <none> 4m47s v1.13.11-gke.23
Looks good! Now, let's update our orchestration script:
#!/usr/bin/env bash
PAT=$1
OWNER=$2
REPO=$3
# make sure we have values for all our arguments
[ -z ${PAT} ] || [ -z ${OWNER} ] || [ -z $REPO ] && {
echo "Incorrect usage, example: ./orc.sh personal-access-token owner some-repo"
exit 1
}
# get the latest commit
latest_commit=$(curl -s -H "authorization: token ${PAT}" \
https://api.github.com/repos/${OWNER}/${REPO}/commits |\
jq -r .[0].sha)
# for each check run requested for this commit, get the "status"
# field and assign to the "check_status" variable
for check_status in $(curl -s \
-H "accept: application/vnd.github.antiope-preview+json" \
-H "authorization: token ${PAT}"\
https://api.github.com/repos/${OWNER}/${REPO}/commits/${latest_commit}/check-runs \
| jq -r '.check_runs[] | "\(.status)"'); do
# if "check_status" is queued launch an action runner
if [ "${check_status}" == "queued" ]; then
echo "Found check run request with status ${check_status}, launching job ..."
cat job.yaml \
| sed -r "s/NAME/$(uuidgen)/; s/OWNER/${OWNER}/; s/REPO/${REPO}; s/TOKEN/${TOKEN}" \
| kubectl apply -f -
else
echo "Found check run request with status '${check_status}', nothing to do ..."
fi
done
And create the job.yaml
, the specification for our kubernetes job:
apiVersion: batch/v1
kind: Job
metadata:
name: {NAME}
spec:
template:
spec:
containers:
- name: {NAME}
image: wayofthepie/actions-image
args: ["{OWNER}", "{REPO}", "{TOKEN}"]
restartPolicy: Never
backoffLimit: 4
⚠️ WARNING ⚠️
The token here should be a stored as a kubernetes secret. Using the token as I have above is not good practice. I will fix this later in this post.
Let's test it:
$ ./orc.sh ${PAT} ${OWNER} ${REPO}
Found check run request with status 'completed', nothing to do ...
Great, it works when there are no runs requested. Commit to the repo you are testing against and run again:
$ ./orc.sh ${PAT} ${OWNER} ${REPO}
Found check run request with status queued, launching job ...
job.batch/990a0d3d-bb98-419e-abc5-ca4fa48ca328 created
Looks like it worked. Let's see what's running:
$ kubectl get jobs
NAME COMPLETIONS DURATION AGE
990a0d3d-bb98-419e-abc5-ca4fa48ca328 0/1 5s 5s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
990a0d3d-bb98-419e-abc5-ca4fa48ca328-wxn5c 1/1 Running 0 8s
$ kubectl logs 990a0d3d-bb98-419e-abc5-ca4fa48ca328-wxn5c -f
Unrecognized command-line input arguments: 'name'. For usage refer to: .\config.cmd --help or ./config.sh --help
--------------------------------------------------------------------------------
| ____ _ _ _ _ _ _ _ _ |
| / ___(_) |_| | | |_ _| |__ / \ ___| |_(_) ___ _ __ ___ |
| | | _| | __| |_| | | | | '_ \ / _ \ / __| __| |/ _ \| '_ \/ __| |
| | |_| | | |_| _ | |_| | |_) | / ___ \ (__| |_| | (_) | | | \__ \ |
| \____|_|\__|_| |_|\__,_|_.__/ /_/ \_\___|\__|_|\___/|_| |_|___/ |
| |
| Self-hosted runner registration |
| |
--------------------------------------------------------------------------------
# Authentication
√ Connected to GitHub
# Runner Registration
Enter the name of runner: [press Enter for 990a0d3d-bb98-419e-abc5-ca4fa48ca328-wxn5c]
√ Runner successfully added
√ Runner connection is good
# Runner settings
√ Settings Saved.
√ Connected to GitHub
2020-02-08 16:35:49Z: Listening for Jobs
2020-02-08 16:35:53Z: Running job: build
Great! It kicked off a build. However, notice the warning at the start:
Unrecognized command-line input arguments: 'name'. For usage refer to: .\config.cmd --help or ./config.sh --help
Something is wrong... A quick look through orc.sh
and job.yaml
highlights the issue - we are missing the fourth argument for the wayofthepie/actions-runner
image! This sets the name of the actions runner. Let's fix it up:
apiVersion: batch/v1
kind: Job
metadata:
name: {NAME}
spec:
template:
spec:
containers:
- name: {NAME}
image: wayofthepie/actions-image
# here we add the name argument
args: ["{OWNER}", "{REPO}", "{TOKEN}", "{NAME}"]
restartPolicy: Never
backoffLimit: 4
Let's commit to the test repo and run again:
$ ./orc.sh ${PAT} ${TOKEN} ${REPO}
Found check run request with status queued, launching job ...
job.batch/7abcb7a1-b1bb-4641-88af-fc4562e29bb7 created
$ kubectl get jobs
NAME COMPLETIONS DURATION AGE
7abcb7a1-b1bb-4641-88af-fc4562e29bb7 0/1 4s 4s
990a0d3d-bb98-419e-abc5-ca4fa48ca328 1/1 46s 12m
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
7abcb7a1-b1bb-4641-88af-fc4562e29bb7-q5jdd 1/1 Running 0 13s
990a0d3d-bb98-419e-abc5-ca4fa48ca328-wxn5c 0/1 Completed 0 12m
$ kubectl logs 7abcb7a1-b1bb-4641-88af-fc4562e29bb7-q5jdd
--------------------------------------------------------------------------------
| ____ _ _ _ _ _ _ _ _ |
| / ___(_) |_| | | |_ _| |__ / \ ___| |_(_) ___ _ __ ___ |
| | | _| | __| |_| | | | | '_ \ / _ \ / __| __| |/ _ \| '_ \/ __| |
| | |_| | | |_| _ | |_| | |_) | / ___ \ (__| |_| | (_) | | | \__ \ |
| \____|_|\__|_| |_|\__,_|_.__/ /_/ \_\___|\__|_|\___/|_| |_|___/ |
| |
| Self-hosted runner registration |
| |
--------------------------------------------------------------------------------
# Authentication
√ Connected to GitHub
# Runner Registration
√ Runner successfully added
√ Runner connection is good
# Runner settings
√ Settings Saved.
√ Connected to GitHub
2020-02-08 16:48:30Z: Listening for Jobs
2020-02-08 16:48:34Z: Running job: build
2020-02-08 16:48:52Z: Job build completed with result: Succeeded
# Runner removal
√ Runner removed successfully
√ Removed .credentials
√ Removed .runner
Great! No warnings, all working as expected. The code up to this point can be seen here.
Storing our token properly
Right now our personal access token gets stored in the definition of our job! If we retrieve our job we can see it:
$ kubectl get jobs 990a0d3d-bb98-419e-abc5-ca4fa48ca328 -o json
{
"apiVersion": "batch/v1",
"kind": "Job",
"metadata": {
...
},
"spec": {
...
"template": {
...
"spec": {
"containers": [
{
"args": [
"wayofthepie",
"gh-app-test",
"THE TOKEN IS IN HERE!!!"
],
"image": "wayofthepie/actions-image",
"imagePullPolicy": "Always",
"name": "990a0d3d-bb98-419e-abc5-ca4fa48ca328",
...
}
],
...
}
This is not good! We should be storing this as a kubernetes secret. Let's do that. First create a secret.yaml
defining our secret:
apiVersion: v1
kind: Secret
metadata:
name: github-token
type: Opaque
stringData:
token: {TOKEN}
To create the secret:
$ cat secret.yaml \
| sed -r "s/\{TOKEN\}/${YOUR_PAT}/"\
| kubectl apply -f -
secret/github-token created
$ kubectl get secrets
NAME TYPE DATA AGE
...
github-token Opaque 1 30s
With that created let's update our job spec in job.yaml
:
apiVersion: batch/v1
kind: Job
metadata:
name: {NAME}
spec:
template:
spec:
containers:
- name: {NAME}
image: wayofthepie/actions-image
args: ["{OWNER}", "{REPO}", "$(GITHUB_TOKEN)", "{NAME}"]
env:
- name: GITHUB_TOKEN
valueFrom:
secretKeyRef:
name: github-token
key: token
restartPolicy: Never
backoffLimit: 4
See these docs for more on reading secret data into env vars.
Also note the syntax to reference the GITHUB_TOKEN
it uses $()
and not ${}
, see here.
Re-run and everything should work:
$ ./orc.sh ${PAT} ${OWNER} ${REPO}
Found check run request with status queued, launching job ...
job.batch/eb1e314d-594b-4253-ae8a-74c797a2cd76 created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
eb1e314d-594b-4253-ae8a-74c797a2cd76-x47lr 0/1 ContainerCreating 0 3s
$ kubectl logs -f eb1e314d-594b-4253-ae8a-74c797a2cd76-x47lr
...
2020-02-08 17:41:08Z: Listening for Jobs
2020-02-08 17:41:12Z: Running job: build
2020-02-08 17:41:30Z: Job build completed with result: Succeeded
# Runner removal
√ Runner removed successfully
√ Removed .credentials
√ Removed .runner
Great! We should also clean up orc.sh
:
#!/usr/bin/env bash
PAT=$1
OWNER=$2
REPO=$3
# make sure we have values for all our arguments
[ -z ${PAT} ] || [ -z ${OWNER} ] || [ -z $REPO ] && {
echo "Incorrect usage, example: ./orc.sh personal-access-token owner some-repo"
exit 1
}
# get the latest commit
latest_commit=$(curl -s -H "authorization: token ${PAT}" \
https://api.github.com/repos/${OWNER}/${REPO}/commits |\
jq -r .[0].sha)
# for each check run requested for this commit, get the "status"
# field and assign to the "check_status" variable
for check_status in $(curl -s \
-H "accept: application/vnd.github.antiope-preview+json" \
-H "authorization: token ${PAT}"\
https://api.github.com/repos/${OWNER}/${REPO}/commits/${latest_commit}/check-runs \
| jq -r '.check_runs[] | "\(.status)"'); do
# if "check_status" is queued launch an action runner
if [ "${check_status}" == "queued" ]; then
echo "Found check run request with status ${check_status}, launching job ..."
cat job.yaml \
# we removed the {TOKEN} replacement here
| sed -r "s/\{NAME\}/$(uuidgen)/g; s/\{OWNER\}/${OWNER}/; s/\{REPO\}/${REPO}/" \
| kubectl apply -f -
else
echo "Found check run request with status '${check_status}', nothing to do ..."
fi
done
The code up to this point can be found here.
Running our script as a kubernetes cronjob
To run our orchestrator script as a kubernetes cronjob we first need to create a docker image:
FROM ubuntu
RUN useradd -m actions \
&& apt-get update \
&& apt-get install -y \
curl \
jq \
uuid-runtime
RUN curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.17.0/bin/linux/amd64/kubectl \
&& mv kubectl /usr/local/bin \
&& chmod +x /usr/local/bin/kubectl
WORKDIR /home/actions
USER actions
COPY orc.sh .
ENTRYPOINT ["./orc.sh"]
I built this as wayofthepie/actions-orchestrator
and pushed to the public docker registry. Next, let' create a CronJob
spec:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: actions-orchestrator
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: actions-orchestrator
image: wayofthepie/actions-orchestrator
args: ["$(GITHUB_TOKEN)", "{OWNER}", "{REPO}"]
env:
- name: GITHUB_TOKEN
valueFrom:
secretKeyRef:
name: github-token
key: token
restartPolicy: Never
This will run our orchestrator every minute. To create the job, replace with your own repo and owner:
$ cat cron.yaml \
| sed -r "s/\{OWNER\}/wayofthepie/; s/\{REPO\}/gh-app-test/" \
| kubectl apply -f -
cronjob.batch/actions-orchestrator created
$ kubectl get cronjob
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
actions-orchestrator */1 * * * * False 0 <none> 7s
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
actions-orchestrator-1581193620-4j8jz 0/1 ContainerCreating 0 1s
$ kubectl logs actions-orchestrator-1581193620-4j8jz
Found check run request with status queued, launching job ...
Error from server (Forbidden): error when retrieving current configuration of:
...
from server for: "STDIN": jobs.batch "316b08ed-89e7-4321-a521-897c7a40fa50" is forbidden: User "system:serviceaccount:default:default" cannot get resource "
jobs" in API group "batch" in the namespace "default"
An error! Let's delete the cronjob so it doesn't keep running, kubectl delete cronjob actions-orchestrator
, and investigate.
Assigning the correct permissions
It seems the default service account we get in the pod does not have access to the jobs resource. To fix this we need to create a ClusterRole
and ClusterRoleBinding
:
roleRef:
kind: ClusterRole
name: jobs-manager
apiGroup: rbac.authorization.k8s.io
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: default
name: jobs-manager
rules:
- apiGroups: ["batch", "extensions"]
resources: ["jobs"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
And create it:
$ kubectl apply -f cluster-role.yaml
clusterrole.rbac.authorization.k8s.io/default created
Re-create our cronjob:
$ cat cron.yaml | sed -r "s/\{OWNER\}/wayofthepie/; s/\{REPO\}/gh-app-test/" | kubectl apply -f -
cronjob.batch/actions-orchestrator created
# it should run every minute
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
actions-orchestrator-1581196020-tmbll 1/1 Running 0 4s
Great! Now if we commit to the test repo it should create a new job for the requested check runs.
Fixing a bug in our logic
We still only check the last commit for check requests meaning we can still miss requests, leaving check runs for some commits idle. This is a bug.. The real fix for this would require either a lot of API calls or using webhooks. But for now we can look at the last 5 minutes of commits rather than just the last commit. If we run the script every minute there is a much smaller chance of a missing check runs.
The updates to orc.sh
are:
#!/usr/bin/env bash
PAT=$1
OWNER=$2
REPO=$3
# make sure we have values for all our arguments
[ -z ${PAT} ] || [ -z ${OWNER} ] || [ -z $REPO ] && {
echo "Incorrect usage, example: ./orc.sh personal-access-token owner some-repo"
exit 1
}
# get the date format in the format the github api wants
function five_minutes_ago {
echo $(date --iso-8601=seconds --date='5 minutes ago' | awk -F'+' '{print $1}')
}
echo "Getting commits from the last 5 minutes ..."
commits=$(curl -s -H "accept: application/vnd.github.antiope-preview+json" \
-H "authorization: token ${PAT}" \
https://api.github.com/repos/${OWNER}/${REPO}/commits?since="$(five_minutes_ago)Z" \
| jq -r .[].sha)
for commit in ${commits[@]}; do
echo "Checking ${commit} for check requests ..."
# for each check run requested for this commit, get the "status"
# field and assign to the "check_status" variable
for check_status in $(curl -s \
-H "accept: application/vnd.github.antiope-preview+json" \
-H "authorization: token ${PAT}"\
https://api.github.com/repos/${OWNER}/${REPO}/commits/${commit}/check-runs \
| jq -r '.check_runs[] | "\(.status)"'); do
# if "check_status" is queued launch an action runner
if [ "${check_status}" == "queued" ]; then
echo "Found check run request with status ${check_status}, launching job ..."
cat job.yaml \
| sed -r "s/\{NAME\}/$(uuidgen)/g; s/\{OWNER\}/${OWNER}/; s/\{REPO\}/${REPO}/" \
| kubectl apply -f -
else
echo "Found check run request with status '${check_status}', nothing to do ..."
fi
done
done
Rebuild the actions orchestrator image, push and it should all work! The image up to this point is tagged as wayofthepie/actions-orchestrator:8-2-2020.
The code up to this point can be found here.
Conclusion
We now have away of orchestrating actions runners on kubernetes. There are still a few issues however:
- There is no error recovery and the error messages are pretty bad. For example if for some reason the cronjob does not run for 5+ minutes we may miss commits and check runs.
- It would be much better to use webhooks here.
- We currently only support watching a single repository.
- Things are getting complicated with bash!
I will tackle some of these in the next post.
Top comments (6)
Nice serie of blogs about orchestrating action runners. I took a slightly order approach. I build a orchestrator with a serverless archecture that act based on the GitHub events to create action runners. Feel free to have a look on my post dev.to/npalm/scaling-github-action...
That looks great! I wanted to end this series with something similar, but decided to not go down that path. The main reason is if you look in the actions-runner code, after registration it just polls a url like pipelines.actions.githubuserconten.... I think uses part of the AzureDevOps infrastructure.
If github documented this and made it a public API we could build agentless ephemeral runners much much easier. The check run API is good, but this would be even better. I am thinking of building something around this as a POC, if I get some time.
Still, your solution looks great also, there are quite a few great solutions out there now. I can't use lambda currently but may look at something similar, based off that, once we need to use self hosted actions. Thanks for sharing!
Great post, I loved the narrative nature of it, well done.
I think since it's written, there have been two updates from GitHub that will make this even smoother:
1) using the
--ephemeral
parameter when invoking the./configure.sh
command, which will clean the runner and de-register it automatically. docs.github.com/en/actions/hosting...2) Using the
workflow_job
event when a job is queued: docs.github.com/en/developers/webh...Awesome posts.
I'm looking forward to part 6: "provisioning ephemeral clusters for builds"
Maybe you are interested in github.com/evryfs/github-actions-r... ?
Cool! This looks like what I hoped the outcome of these posts would be, ran out of time the last few weeks.
Will play about with this soon thanks!