Today I'm going to show you how to build a GitHub stats scraper using a scheduled-execution AWS Lambda function.
The Python code inside the Lambda will leverage PyGitHub to communicate with the V3 GitHub APIs. The code builds a simple Markdown-flavor report, then publishes it to a Gist. The Gist will display the Markdown in rendered form, so you can share a link to it.
The tables we'll build will look something like this:
Motivation
AWS Amplify is almost entirely open-source. We use GitHub to host our code. Our engineers tend to work across our various repositories, and it can be tough to keep track of contributions. Likewise, we want to recognize external parties who have donated their time and mindshare.
Some questions I'd like this tool to help answer:
- Who are the top authors of pull requests, across our GitHub org?
- Who are the top reviewers of pull requests, across our GitHub org?
There are, of course, lots of ways to contribute to a project. Tracking activity on GitHub pull requests is just one view of the world. This tool just makes data available. It doesn't prescribe a way for you to analyze it.
System Overview
The system works like this:
- Amazon CloudWatch fires a periodic event
- It causes an AWS Lambda function to execute
- The lambda function queries GitHub's API for contribution data
- The lambda builds a Markdown report
- The lambda publishes the Markdown to a Github gist
Setting up the Basics
Creating a lambda function and setting it up for periodic execution are fairly straightforward tasks in the AWS Console. After you follow the linked guides from the AWS documentation, you should be left with a dummy Python function that just prints out whatever arguments are passed to it.
In order to start communicating with GitHub, we'll have to obtain an access token. The PyGitHub documentation shows a concise example of initializing a client with a GitHub token:
g = Github('access_token')
To obtain one, head over to github.com/settings/tokens, and click "Generate new token" in the top-right. Give it some name you'll remember. Leave all of the scopes unchecked, but do check the gist box:
Lastly, click "Generate token" at the bottom of the page, and copy-paste the output. Keep the token someplace safe where only you'll have access to it.
Setting up Dependencies in our Lambda
For simple computations, you can often update a Lambda function directly from the AWS Lambda Console. However, PyGitHub doesn't exist in the Lambda execution environment. So we'll need to make it available to our Lambda, somehow. The strategy we'll use is to bundle all of our dependencies with the Lambda, so that they're locally available when the code runs.
AWS Lambda provides documentation to "Deploy Python Lambda functions with .zip file archives." But, I'll take a more targeted approach, here. The tasks needed to add the PyGitHub dependency to our Lambda are described below.
First, use the AWS CLI to find the lambda function you created:
aws lambda list-functions | \
jq -r '.Functions[].FunctionName'
This outputs:
ExampleLambda
demo_function
top_contribs_periodic
In this list, I recognize top_contribs_periodic
as the Lambda that I created earlier.
Now we're going to ask AWS Lambda to give us a URL from which we can download the function code:
aws lambda get-function \
--function-name top_contribs_periodic | \
jq -r '.Code.Location'
The URL will be very long. It's a pre-signed S3 URL for a zip file in an AWS-managed S3 bucket. It should start off like:
https://prod-04-2014-tasks.s3.us-east-1.amazonaws.com/snapsh....
Now we can copy-paste that URL, and use curl
to download the zip file from S3:
curl '<paste-url-from-above>' --output original_lambda.zip
While we're working on this, we're going to be manipulating zip files that contain a bunch of content at the root level of the archive. Whenever I'm working with a zip that doesn't have a single top-level directory in it, I like to know what I'm dealing with, first:
zip --show-files original_lambda.zip
This outputs:
Archive contains:
lambda_function.py
Total 1 entries (2217 bytes)
Let's make a working directory, and unzip the code there:
mkdir workingdir
cd workingdir
unzip ../original_lambda.zip
Next, we're going to use the virtualenv
utility to install our dependencies into this directory. (This step follows the Lambda documentation, "with a virtual environment" pretty closely.)
Let's make sure you have virtualenv
installed:
pip3 install virtualenv
Now, create a virtual environment:
virtualenv myvenv
Activate the environment:
source myvenv/bin/activate
Install PyGitHub:
pip install pygithub
Almost done. Deactivate the environment:
deactivate
Start creating a new zip file. We'll upload this zip back to AWS Lambda, in a second.
cd myvenv/lib/python3.8/site-packages
zip -r ../../../../updated_lambda.zip .
The last thing we need to do before re-uploading the code is to package our lambda_handler.py
back into the .zip
. So that we can validate our dependency installation, let's include a very simple use of the PyGitHub. Update the lambda_handler.py
so that it looks like this:
import os
from github import Github
def lambda_handler(event, context):
g = Github(os.environ['GITHUB_TOKEN'])
g.get_user()
The code above will look for your GitHub token in the environment, try to instantiate an API client, and the call some simple API, and not crash.
Save this file into the zip:
cd -
zip -g updated_lambda.zip lambda_handler.py
And finally, upload it:
aws lambda update-function-code \
--function-name top_contribs_periodic \
--zip-file fileb://updated_lambda.zip
Back in the AWS Console, add a new environment variable, GITHUB_TOKEN
, into your Lambda function. Set its value to the one you captured from github.com/settings/tokens.
By now, you should be able to click the "Test" button in the top-right of your AWS Lambda console, to try and invoke the function. The simple Lambda should succeed. But of course, it doesn't really do much yet.
Building out the Core Logic
Well, we succeeded in making a simple call to GitHub from AWS Lambda! Great. Now, let's implement some logic.
Lets update the Lambda to do something a little more complex.
import os
from github import Github
from datetime import *
def list_recent_contributions(g):
org = g.get_organization('aws-amplify')
repos = [r for r in org.get_repos()]
month_ago = datetime.now() - timedelta(30)
for pull in repo.get_pulls(sort='created', direction='desc', state='all'):
if pull.created_at < month_ago:
break
login = pull.user.login
title = pull.title
print('Recent pull from {0}: {1}'.format(login, title))
def lambda_handler(event, context):
g = Github(os.environ['GITHUB_TOKEN'])
list_recent_contributions(g)
The code above finds all public repos in an organization ('aws-amplify'
), and then queries them for pull requests. The search is in descending order by creation date, meaning that recently created PRs show up first. We stop iterating over the pages of results when we start seeing pulls that are outside the window in which we're interested. (In this case, we're only looking at the last 30 days.)
You can kind of see the possibilities at this point. You can bounce between the PyGitHub Library Reference and the GitHub REST API reference and start tuning the logic to meet your needs.
Building Markdown
There are two main challenges in producing the Markdown report.
- Decoupling document text from Python language syntax
- Producing data tables, row by row
To achieve the first goal, I'm using a mutli-line string. To strip the leading indentation in the Python document, I call textwrap.dedent
on it. The document text has a couple of placeholders that I populate with str.format(...)
, at the end. My documentation creation function looks roughly like this:
return textwrap.dedent("""
# Top Contributors, Last 30 Days
This is a list of the top contributors to the aws-amplify Github org's public repos.
Contributions from the last 30 days are considered.
This document is updated by a cron job every day.
Contributors are from AWS and from the community.
Contribution counts are a running sum of a user's contributions across all repos.
### Top 10 Authors
{0}
### Top 10 Reviewers (by total comments)
{1}
-----------------------
Last updated {2}.
""").format(authors_table, reviewers_table, str(datetime.datetime.today()))
But what about the authors_table
and reviewers_table
? How are those obtained? To build the Markdown tables, I use some more string templates. I have a utility function which can transform a dict
into a Markdown table:
def top_ten_table(key_label, val_label, entries):
row_template = '|{0}|{1}|{2}|\n'
table = row_template.format('Rank', key_label, val_label)
table += row_template.format('--------', '--------', '--------')
item_list = sorted(entries.items(), key=lambda x: x[1], reverse=True)
for index, (key, value) in enumerate(item_list[:10]):
table += row_template.format(str(1 + index), key, str(value))
return table
Publishing a Gist
The last big piece of this project is to publish the outputs somewhere.
There are a number of cool possibilities here. If you're using some static hosting solution like Amplify Hosting or GitHub Pages, your published Markdown could get stylized with whatever rules your existing web app applies. You could pretty much publish the Markdown anywhere.
I had considered publishing the output to my GitHub Pages repository, and then letting my Jekyll site render it with its stylesheet.
But, in the interest of keeping things simple, let's just put it into a Gist for right now.
The Gist APIs are actually a weak-spot in the PyGitHub implementation, IMO. What I'd like to do is just have a user.update_gist(...)
function available. Unfortunately, there isn't anything exactly like that. Instead, we have to related capabilities available:
- Search all of my Gists, get a handle to one, and call
.update(...)
on it; - Create another new Gist.
Well, fine. Let's try to encapsulate those two things into a single utility method:
def write_gist(gh, filename, description, content):
files = {filename: github.InputFileContent(content=content)}
user = gh.get_user()
print("Looking for matching Gists....")
for gist in user.get_gists():
if gist.description == description:
print("Found a matching Gist. We'll updated it.")
gist.edit(files=files, description=description)
return
print("No existing Gist, creating a new one....")
user.create_gist(public=True, files=files, description=description)
This function will look for an existing Gist with a given description. If it finds one, then it updates it. If it doesn't find one, it continued to create a new Gist with that description. ๐ฅณ
When you run this function, it will save the provided content to Gist. Since it currently hard-codes 'contrib.md'
as the filename, the resulting URL you end up with will look like this:
https://gist.github.com/yourusername/uniqueid#file-contrib-md
Wrapping Up
That's most of the nuts and bolts to it. The complete script I'm using is available here, on GitHub.
The output of that script is visible here.
Let me know what you think! What features should I add to it next? Happy hacking, and happy holidays. ๐
Top comments (0)