Dmitry Yakimenko

Posted on Feb 15, 2019

Git-Fu: merge multiple repos with linear history

#git #madness #reinventingthewheel

The other day I invented myself a new headache: I wanted to merge a few libraries I've built over the years into one repo and refactor them together. It looks simple at first glance, copy all the files into subfolders, add and commit:

$ git add .
$ git commit -m 'All together now!'

Done!

No, not really. This would eliminate the history. And I really wanted to keep it. I often go back to see what changes have been made to a file, I use blame to see when and why certain code was modified. I looked around to see how to get it done quickly. I found a whole bunch blog posts describing similar undertakings. I also found code snippets, shell and Python scripts and even Java programs to do just that.

After trying some of those (not the Java though, no thank you) I realized they don't do exactly what I want. Some of the authors tried to keep the original commit hashes. Some authors wanted to have the original file paths. I wanted to be able to track changes from the beginning of the history.

Most of the approaches I found were not compatible with my goals. Usually people try to import a repo into a branch (usually by adding a remote from another repo), move all the files into a subfolder in one commit and then merge that branch into master. This creates one big commit where all the files get moved to a subfolder. And then another giant merge commit, where all the changes from one branch get copied to another branch. When you view such a repo on GitHub, you'd see that file history gets broken (blame still works though).

I also discovered a built-in command git subtree and it turns out it suffers from the same problems as all the third party tools I tried before that. So no go! Need to reinvent the wheel here and come up with my own solution.

So, basically, I needed to find a way to merge all the repos without creating any merge commits. And I needed to move the original files into subfolders. Two Git tools come to mind: cherry-pick and filter-branch.

A sidenote. I used to use Mercurial at work a few years back and it was great! The user experience on Mercurial is amazing. I kinda wish it didn't die a slow death and let the inferior product to take over the dev scene. As Mercurial was intuitive and user friendly as Git is powerful and versatile. Git is like a really twisted Lego set: you can build whatever you want out of it.

So here's the plan:

put each repo in its own branch
rewrite the history to move all the files in each commit into a subfolder
rewrite the history to prepend the repo name to the commit message
cherry pick all the commits from all the branches in chronological order into master
delete branches
garbage collect to shrink the repo

Easy peasy. Feel kinda masochistic today, so let's do in Bash.

First, like always we need to make sure the whole script fails when any command fails. This usually saves a lot of time when something goes wrong. And it usually does.

#!/bin/bash
set -euo pipefail

List of repos I'd like to join:

repos="1password bitwarden dashlane lastpass opvault passwordbox roboform stickypassword truekey zoho-vault"

Make sure we start form scratch:

rm -rf joined
git init joined
cd joined

Now, here's a tough one:

for repo in $repos; do
    git remote add $repo $REPO_DIR/repo
    git fetch $repo
    git checkout -b $repo $repo/master
    echo -n "$repo: " > prefix
    git filter-branch \
        -f \
        --tree-filter "mkdir -p .original/$repo && rsync -a --remove-source-files ./ .original/$repo/" \
        --msg-filter "cat $(pwd)/prefix -" \
        --env-filter 'GIT_COMMITTER_DATE="$GIT_AUTHOR_DATE"' \
        $repo
    rm prefix
done

Let's go over it piece by piece. First, I import a repo into its own branch. The repo named lastpass end up in a branch named lastpass. Nothing difficult so far.

git remote add $repo $REPO_DIR/repo
git fetch $repo
git checkout -b $repo $repo/master

In the next step I rewrite the history for each repo to move files into a subfolder for each commit. For example, all the files coming from the repo lastpass would end up in the .original/lastpass/ folder. And it would be changed for all the commits in the history, like all the development was done inside this folder and not at the root.

git filter-branch \
    -f \
    --tree-filter "mkdir -p .original/$repo && rsync -a --remove-source-files ./ .original/$repo/" \
    --msg-filter "cat $(pwd)/prefix -" \
    --env-filter 'GIT_COMMITTER_DATE="$GIT_AUTHOR_DATE"' \
    $repo

The filter-branch command is a multifunctional beast. It's possible to change the repo beyond any recognition with all the possible switches it provides. It's possible to FUBAR it too. Actually it's super easy. That's why Git creates a backup under refs/original/refs/heads branch. To force the the backup to be overwritten if it's already there I use the -f switch.

When the --tree-filter switch is used, every commit is checked out to a temporary directory and using regular file operations I can rewrite the commit. So for every commit, I create a directory .original/$repo and move all the file into it using rsync.

The --mag-filter switch allows me to rewrite the commit message. I'd like to add the repo name to the message, so that all the commits that are coming from the lastpass repo would look like lastpass: original commit message. For each commit the script would receive the commit message on stdin and whatever comes out to stdout would become the new commit message. In this case I use cat to join prefix and stdin(-). For some reason I couldn't figure out why simple echo -n wouldn't work, so I had to save the message prefix into a file.

And the last bit with --env-filter is needed to reset the commit date to the original date (author date in Git terminology). If I didn't do it, Git would change the timestamp to the current time. I didn't want that.

Next step would be to copy all those commits to the master branch to flatten the history. There's no master branch yet. Let's make one. For some reason Git creates a branch with all the files added to the index. Kill them with git rm.

git checkout --orphan master
git rm -rf .

To copy the commits, I need to list them first. That is done with the log command:

git log --pretty='%H' --author-date-order --reverse $repos

This command produces a list of all the commit hashes sorted from the oldest to the newest across all the branches I created earlier. The output of this step looks like this:

7d62b1272b4aa37f07eb91bbf46a33609d80155f
a8673683cb13a2040299dcb9c98a6f1fcb110dbd
f3876d3a4900e7f6012efeb0cc06db241b0540d6
7209ecf519475e59494504ca2a75e36ad9ea6ebe

Now that I have the list, I iterate and cherry-pick each commit into master:

for i in $(git log --pretty='%H' --author-date-order --reverse $repos); do
    GIT_COMMITTER_DATE=$(git log -1 --pretty='%at' $i) \
        git cherry-pick $i
done

The GIT_COMMITTER_DATE environment variable is again used to reset the commit date to the original creation time, which I get with the log command again like this:

git log -1 --pretty='%at' <COMMIT-HASH>

After these steps I have a repo with flat history, where each original repo lives in its own subdirectory under .original/. I can use GitHub file history and blame to see all the changes that happened to the original files since their birth. And since Git tracks renames I could just move these files to their new home inside the megarepo and I would still get the history and blame working.

The only thing left to do is to clean up the repo, delete all the branches I don't need anymore and run the garbage collector to take out the trash.

for repo in $repos; do
    git branch -D $repo
    git remote remove $repo
    git update-ref -d refs/original/refs/heads/$repo
done

git gc --aggressive

The resulting repo lives here. I feel like it was worth the time I put into that. It gave me an opportunity to learn more about the Git low level commands. And now I have a repo that I can browse with ease and don't need to jump between the branches every time I want to check some file history.

The script I used could be found here.

Top comments (9)

Juan Pablo Almonacid • Feb 16 '19

Hi Dmitry! I've enjoyed the post very much, particularly the use of the git filter-branch command. I recall having used it once a while ago to fix a typo in my email address (GIT_COMMITTER_EMAIL) on several commits from a couple of personal repos.
Beyond that, what I'd like to know about, because I'm really intrigued, is why you needed to merge all the repos in one repo. Are those libraries related in some way? Have you taken into account the option using git submodule to add the repos as submodules of a parent project?
Thanks!

Dmitry Yakimenko • Feb 17 '19

Thanks! I used it a few times before myself and every time I just pasted a code snippet from the docs or stackoverflow and was done with it. This time I decided to dig deeper.

I would like to merge these libraries (not just repos), because they share a bunch of code, that for historical reasons got copy-pasted and modified a bunch of times. I would also like to harmonize their API and make them share even more code. Another approach would be to take out the shared part and make it its own library, but I find it too tedious and this library on its own would not be useful to anyone. The submodules approach is not gonna work in this case, because I'm going to move files around and make global refactoring, which wouldn't make sense in each single repo.

Eryk • Mar 23 '20

Hey Dmitry,

Thank you for the grate post and script.

I have a question about usage: We are trying to build a single master 'IT' repo that pulls all the tool/script/fix/setup/etc. repos together. For usability, sub-modules and sub-trees would be difficult to implement with our user base. We would also like to monitor the sub-repos so that when they are committed to, we can pull them and update the master 'IT' repo. I can build a daemon or a cron job to meet this need but the question I have would be regarding the state of the repo after merge.

Would I be able to use your script to keep an existing merged repo up-to-date so I could push it our primary revision control server?

Dmitry Yakimenko • Mar 30 '20 • Edited

I have not tried it, but it should recreate the merged repo every time exactly the same. So if you rerun it with updated repos you should get an updated merged repo with the same checksums. This is theory, though. It's possible that in practice things won't be so smooth. Plus it would redo a bunch of work every time and the merge process would be unnecessarily slow. It's better to modify the script to track the state and only apply new commits the merged repo.

Sorry about the late reply. Didn't see the email.

Eryk • Mar 30 '20

Thank you for replying.

Harvey Thompson • Feb 16 '19

In the future you might want to check out Reposurgeon - gitlab.com/esr/reposurgeon

However I did almost exactly the same thing using git and scripts - incrementally improving at each step until I was done. Well worth it in the end.

Dmitry Yakimenko • Feb 18 '19

I tried it out and I couldn't figure out how to use it in a reasonable time. I spent more time with it than with this git-only solution and didn't get even half way to solving my problem. I posted my response here: dev.to/detunized/git-fu-reposurgeo...
Thanks for the tip anyway =)