Pre-push Hooks

#git

What is a Hook?

Git has a few different extension points where you can execute arbitrary scripts. Two of the most well-known are the pre-commit and pre-push hooks. As discussed below, these hooks allow you to perform additional validation before completing a commit or push.

Why Not Just CI?

As a repository grows, you onboard more linting tools, test suites, and other validations. The goal is to catch regressions and bad practices as early as possible. CI/CD (e.g., GitHub Actions) provides a means to ensure that these validations are run on every PR and every commit before being merged, as an invariant.

However, CI/CD jobs can grow to be quite slow - hours, in some cases. For certain tasks that are quick to run, you'd probably prefer to run them on your local development machine before raising a PR.

For example, checking if the locally-changed code is well-formatted is one task that can be done fairly quickly, and makes sense to do before raising a PR.

On the other hand, running a full test suite may be best handled by the CI as a final check. Most changes aren't at risk of breaking unrelated tests, and the developer would like to work on other things while CI works in parallel.

Pre-commit vs. Pre-push

As with most things in Engineering, there is a trade-off between instrumenting your validations in a pre-commit vs. a pre-push hook.

The pre-commit hook will run much more often, and can potentially enable you to keep each commit in a valid state. However, if you need to invoke the project's build system to complete validation, it may start to become a nuisance and ultimately slow you down. Worse, if you're making surgical, atomic commits, the pre-commit hook might even mess up your tree.

Pre-push hooks don't have most of those problems and are the final opportunity to check code before it goes up for PR. One downside: you may end up with an additional "fix formatting"-type commit in your log this way. But, if you have GitHub configured to squash commits on merge, it doesn't matter much. Since pre-push hooks are run much less frequently, it also makes sense to perform some slightly longer validations in this hook.

In a pre-push hook, you have already established a lineage of commits that will be sent for PR. So, it is also a safer time to apply edits to the codebase, if you want to. For example, if the code in the push ref has bad formatting, we might like to format the code instead of just simply checking it. In a pre-commit hook, that might be disruptive to in-flight coding.

How to build a pre-push system

Various tools do provide their own integrations to Git hooks. For example, The Gradle ktlint Plugin offers a few tasks to register check/format tasks as a Git hook. But, as soon as you want to run multiple tasks in your hook, you'll probably wish you had better machinery.

So, let's state a few of our up-front goals. We want to:

Create an extensible script that will allow us to run multiple validation scripts during push;
Provide clear and actionable feedback to the user about what is happening with the push - did it succeed or not, and why not?;
Provide a simple way for developers on our team to install the pre-push scripts and get pertinent updates as improvements are added.

To solve the first problem, let's create a directory in our project called ./scripts/pre-push.d (inspired by the Linux/UNIX practice of splitting out config stubs in .d directories.) We'll plan to have a ./scripts/pre-push script that calls out to the stubs:

To solve the second problem, we'll be careful to fail the pre-push as soon as one of the stubs fails and to be thoughtful about the error message we emit if it does fail.

We can solve the third problem by creating a simple ./scripts/install-pre-push script that our teammates can run. Recall that there's no simple way to have hooks automatically added to .git/hook when a user clones a repository. So, you'll need to ask them to run this script when they're doing developer onboarding.

Our installation script will do a little cleanup in .git/hooks to ensure no old version of hooks will cause a conflict. Then, it will simply create a symbolic link from our ./scripts/pre-push script into .git/hooks. This way the developer will always run the latest pre-push script contents without having to re-run ./scripts/install-pre-push each time there's an update to its logic.

The installation script

The install-pre-push script looks as below. Note that it mostly does cleanup, and then creates a symbolic link.



#!/bin/bash

# Print a message and kill the script.
die() {
    echo "$@" 1>&2
    exit 1
}

# Finds the top of the repo.
find_git_repo_top() {
    local current_dir=$(pwd)

    # Loop until reaching the root directory "/"
    while [ "$current_dir" != "/" ]; do
        # Check if ".git" directory exists
        if [ -d "$current_dir/.git" ]; then
            echo "$current_dir"
            return
        fi

        # Move up one directory
        current_dir=$(dirname "$current_dir")
    done

    # If ".git" directory is not found
    echo "Git repository not found."
    exit 1
}

# Ask the user a yes/no question and await their response. Return 0 if
# they say yes (in some format).
await_yes_no() {
    read -r answer
    case "$answer" in
        [yY]|[yY][eE][sS])
            echo 0
            ;;
        *)
            echo 1
            ;;
    esac
}

# Delete whatever hooks may be active in the .git/hooks directory. This
# may include things like the old pre-commit hook we had been using
delete_existing_hooks_with_confirmation() {
    project_root="$1"
    hooks=$(find "${project_root}/.git/hooks/" -mindepth 1 ! -name "*.sample")
    echo "Found hook files: $hooks"
    echo "OK to delete? [Y/n]"
    if [ "$(await_yes_no)" -ne 0 ]; then
        die "OK; aborting."
    fi
    rm -f -r $hooks
} 

# Install the pre-push script and any hooks found under the ./scripts
# directory.
install_pre_push_hooks() {
    project_root="$1"
    echo "Installing scripts into .git/hooks ..."
    mkdir -p "${project_root}/.git/hooks"
    ln -s "${project_root}/scripts/pre-push" "${project_root}/.git/hooks/pre-push"
}

# Installs pre-push scripts after ensuring we're running from the
# directory root, and after cleaning up any old git hook scripts.
main() {
    project_root="$(find_git_repo_top)"
    delete_existing_hooks_with_confirmation "$project_root"
    install_pre_push_hooks "$project_root"
}

main

The pre-push script

The pre-push script is also pretty simple. Its primary job is to find the stub files in ./scripts/pre-push.d, and then send them the same arguments that Git sent to it.

For a pre-push script, git will send four pieces of information to the hook - and possibly multiple times, for each reference being pushed:

localname
localhash
remotename
remotehash

Most of the time, you'll run something like:



git push origin my-branch

In this case, remotename and remotehash won't be populated with meaningful info, and the four parameters will only arrive once.

It is possible, however, to run more complex push commands such as



git push origin my-branch-1:refs/heads/target-1 my-branch-2:refs/heads/target-2

This is useful for us to understand the meaning of the four arguments, but as we'll see later, we won't need them. Most of our practical hooks will only consider the localhash.



#!/bin/bash
#
# Don't write actual logic in this file. This file just fans out to
# .git/hooks/pre-push.d/<your_file>, so that we can add a number of
# checks into one place. This file aims to honor the original pre-push
# contract and fan it out to stub files.

# Finds the top of the repo.
find_git_repo_top() {
    local current_dir=$(pwd)

    # Loop until reaching the root directory "/"
    while [ "$current_dir" != "/" ]; do
        # Check if ".git" directory exists
        if [ -d "$current_dir/.git" ]; then
            echo "$current_dir"
            return
        fi

        # Move up one directory
        current_dir=$(dirname "$current_dir")
    done

    # If ".git" directory is not found
    echo "Git repository not found."
    exit 1
}

project_root=$(find_git_repo_top)
hooks=$(find ${project_root}/scripts/pre-push.d -type f ! -name "*.sw*" | sort)

# pre-push.d receives four pieces of information for each source-target
# push that may be in play. For exmaple, origin->refs/heads/origin, and
# my_kooll->refs/heads/my_kool_branch would cause two iterations of this
# while loop.
while read localname localhash remotename remotehash; do
    # For each set of push data, iterate over the hooks in alphabetical
    # order. Pass the hook data in using the same contract.
    for hook in $hooks; do
        echo "$localname $localhash $remotename $remotehash" | bash "$hook"
        RESULT="$?"
        if [ $RESULT != 0 ]; then
            exit "$RESULT"
        fi
    done
done

exit 0

A pre-push.d script

The more of these you see and write, you realize there are lots of edge cases. So let's consider our ultimate goal here.

The branch we want to keep in good condition is origin's main. In trunk-based development, we don't care so much about other branches. So as I mentioned above, let's not worry about the remotehash and remotename; let's always compare our local changes against origin/main.

Another concern is performance. Ideally, we'd like to use work avoidance to skip these validations entirely if there have been no relevant changes in the ref being pushed. And even when there are changes between origin/main and the ref to be pushed, we'd like to limit the validations to only those changes, if possible. So, we need some infrastructure for identifying the changed files.

Lastly, we'd like to keep the user informed about what checks are being run and print out information about what state we've left the tree in. If we can auto-correct some of the issues, that would be a nice thing to do, too.

The script below will run ktfmt over a Kotlin codebase and fail the push if any of the files are not well-formatted. It does also use a custom Gradle plugin we built around ktfmt, which accepts a --run-over change set. This script will leave the fixed formatting changes in the tree, awaiting the developer to intervene, commit, and try again.

The script below can be trivially adapted to any other number of tools, and you can include a few such scripts in your ./scripts/pre-push.d. For example, my current codebase has one for ktfmt, and a very similar one for Square's Gradle Dependencies Sorter.



#!/bin/bash
#
# This script looks for Kotlin files that have been changed locally and
# ensures that they are correctly formatted according to ktfmt. This
# script makes an effort to only run on the smallest possible set of
# changed files as opposed to running broadly over the codebase, to
# improve pre-push performance.
#
# The script takes the following steps:
#
#  1. Figure out the name of the remote for your repo on
#     GitHub. If there is no remote (via `git remote`) add one called
#     "your_company"
#  2. Fetch the latest `main` ref from that remote.
#  3. For any commits that are about to be pushed (git push can push
#     multiple references at once), do the following steps, 4-8:
#  4. Find an ancestor commit that is before both origin/main and the
#     commit you're trying to push.
#  5. Compute a list of .kt or .kts files that have changed between that
#     ancestor and the commit being pushed
#  6. Run ./gradlew ktfmtFormatPartial on only those files using --run-over=<list>.
#  7. If there are no formatting changes applied, proceed to push.
#     Otherwise, print a descriptive error message noting that files
#     have been formatted and that the user will need to commit manually
#     and push.

# Print the name of the configured remote (usually "origin")
expected_remote() {
    git remote -v | awk '/git@github.com:Your-org\/your-repo.git \(fetch\)/ { print $1 }'
}

# Returns the name of the your_company origin. If it is not found locally,
# we'll add one called "your_company" (conservative name so it doesn't clash
# with whatever else you have going on.)
ensure_remote_installed() {
    remote_name="$(expected_remote)"
    if [ -z "$remote_name" ]; then
        git remote add your_company "git@github.com:Your-org/your-repo.git"
        echo "your_company"
    else
        echo "$remote_name"
    fi
}

# Fetches, but does not apply, the remote references from GitHub's copy
# of the project.
fetch_remote_refs() {
    remote_name="$1"
    git fetch "$remote_name" main &>/dev/null
}

# Computes a list of the names of the files that have changed between
# two commit hashes.
compute_changed_files() {
    from_hash="$1"
    to_hash="$2"
    git diff --name-only "$from_hash" "$to_hash"
}

# Determines if a file is a kotlin file.
is_kotlin_file() {
    file_path="$1"
    if [[ "$file_path" =~ .kts?$ ]]; then
        echo 0
    else
        echo 1
    fi
}

# Computes which .kts? files have changed.
compute_changed_kotlins() {
    from_hash="$1"
    to_hash="$2"
    changed_kotlins=""
    for changed_file in $(compute_changed_files "$from_hash" "$to_hash"); do
        if [ "$(is_kotlin_file $changed_file)" -eq 0 ]; then
            changed_kotlins="$changed_file $changed_kotlins"
        fi
    done
    echo "$changed_kotlins"
}

# This finds a common ancestor between origin/main and whatever commit
# is being pushed. The idea here is that the local tree may not be
# rebased onto origin/main itself, so we need to look backwards in
# origin/main to find a commit that *is* in our history. This is the
# developer's current marker for origin/main.
compute_from_hash() {
    remote_name="$1"
    local_hash="$2"
    git merge-base "$remote_name/main" "$local_hash"
}

# Given two lists of strings compute the insersection of the two lists,
# e.g., if A="foo bar", and B="foo", the intersection is "foo".
compute_intersection() {
    intersection=""
    foo="$1"
    bar="$2"

    for f in $foo; do
        for b in $bar; do
            if [[ "$b" == "$f" ]]; then
                intersection="$intersection $b"
            fi
        done
    done

    echo $intersection | sort | uniq | xargs
}

# Checks the result of the ktfmt task to see if formatted any files.  If
# files were formatted, fail the hook and emit an error. If none were
# formatted, continue to exit the hook successfully.
fail_if_any_formatted() {
    changed_since_main="$1"
    changed_after_fmt="$(git diff --name-only | xargs)"
    changed_in_both=$(compute_intersection "$changed_since_main" "$changed_after_fmt")
    if [ ! -z "$changed_in_both" ]; then
cat <<- EOF
The following files were not formatted correctly and have been fixed locally. Please commit them and try your push again.
$changed_in_both
EOF
        exit 1
    fi
}

# Runs ktfmt over any changed .kts? files.
validate_ktfmt() {
    remote_name="$1"
    to_hash="$2"
    from_hash="$(compute_from_hash $remote_name $to_hash)"

    changed_kotlins="$(compute_changed_kotlins $from_hash $to_hash | xargs)"
    if [ -z "$changed_kotlins" ]; then
        exit 0
    fi

    echo "Running ktfmtFormatPartial over changed Kotlin files: $changed_kotlins ..."
    ./gradlew ktfmtFormatPartial --run-over="$changed_kotlins" &>/dev/null
    fail_if_any_formatted "$changed_kotlins"
}

remote_name=$(ensure_remote_installed)
fetch_remote_refs "$remote_name"

while read localname localhash remotename remotehash; do
    validate_ktfmt "$remote_name" "$localhash"
done

exit 0

In my opinion, the most interesting part of this script is really the git merge-base call, used to compute a common ancestor between origin/main and the current ref. This is useful since you may not be rebased onto the remote's main when the hook runs. But if origin/main is always correct, then the diff that is presented between origin/main and your ref will encapsulate the entirety of your responsibility.

Conclusion

Well, there you have it. After writing a few of these, I put my "best of" playlist up for all to see. Each of these scripts can be found in this Gist. Let me know if you have a better way of doing this, or any ideas for improvements!