More seriously, though, how git works internally has fascinated me for a while. It's not actually that complicated, but there are two big chunks. I'll cover the abstraction git operates on first, and next week I'll go over how that's stored on disk. Trust me, it'll be easier to understand how it's stored if you understand what's being stored first.
Note that this is not a git tutorial. I'm assuming you have some familiarity with git -- you don't need to know what commits are, but you do need to know what version control is, and
Note: Pointers make an appearance here, as an analogy. If you don't know what they are, there are plenty of incredible resources online to explain them, but that's far beyond the scope of this article. And probably my abilities, too.
I'm gonna break git down into three pieces:
- The repository
The repository, "repo" for short, is the "container" of git. Git doesn't operate on your entire filesystem, as you probably know. It's limited to just one directory, and all the directories that are recursively in that directory. In the root of the repository is the
.git folder, which stores all of git's internal files.
You can actually have a repository inside a repository -- if you've ever been told to
git clone --recursive something, that's what it relates to. This article is just going over the basic concepts of git, though, so I won't cover them here.
There are some more corner cases, like
.gitignore. I highly recommend looking into
.gitignores for your git repos if you don't already have them. They might have been autogenerated by your IDE or in place from your template, but you should take a peek, learn the syntax, and see if you can't make them better. There's usually some stuff that can safely be omitted from your build.
A commit is a snapshot of the state of the repository. It stores the contents of every file as they are at the moment of a commit -- this is why
.git folders can sometimes get so big. Each commit is basically a tarball of the entire repository! Git doesn't use the
tar format internally, it uses a format that's way more space-efficient and doesn't clone identical data repeatedly, but that's a digression for next week.
If you've used git a lot, you might notice that's not how they're typically displayed, though. They're normally displayed as diffs. And when you do things like merging (which we'll get to in a minute, I promise) you merge sets of differences, not two entire files. So... how does that work?
Don't worry. We'll get to it.
The snapshot of your directory is accompanied by some metadata. The most important, in my opinion, are:
- Commit message. This tells people about your commit, and is vital to making your code's history easy to understand. Write good commit messages and you'll inevitably thank your past self.
- Author: The person who wrote the code. Almost definitely you, though some projects do accept patches and change the author to reflect the actual source of the code.
- Parents: The commit (or, occasionally, commit*s*) which this commit follows. This concept will be explored more in the Branches section.
These metadata (along with the rest that git stores) are bundled together with the 'tarball' into a "commit object".
Commits don't quite have names. What they have is a hash. Today, that's generated with SHA-1, though it's expected that git will start to migrate to SHA-256... sometime soon. Whatever the algorithm, it's applied to the commit object, and the result is the commit's "name".
The name, with SHA-1, looks something like this:
e9f8ebe40fadf3a644f92c4a5e4af70f92d29347. That's usually condensed down to the first 7 or 8 characters, so
e9f8ebe instead, though you can use any number of characters, so long as it's more than 7 and enough to uniquely identify the commit's hash.
You might or might not know about
git stash. It's really helpful, and I love it a lot. It lets you save the current state of your repo without making any changes to history, and easily reapply the changes. A stash is basically a commit object, but not pointed to by a branch (we'll get into that in a second). When you pop what's on your stash, it gets merged in basically like a branch (again, in a second).
Branches are interesting. In a sense, they don't actually exist. They're just pointers (told you they'd crop up!) to commit objects. When you add a new commit to a branch, the commit's parent is set to what the branch is currently pointing to, and the branch is set to point to the commit you just made.
The "branch", then, is just the parents, grandparents, etc. of the commit that the branch is pointing to. This is, as you may guess, somewhat confusing to define. Where does a branch start? When it 'splits off' from another? How can you tell that the branch you're looking at split off, and not the other?
Good question. A very good question. Such a good one, in fact, that I don't actually know the answer -- it seems to be a mix of convention and just sort of letting things work themselves out. By all means, if you can shed some light on this, please comment.
HEAD are the same thing as branches: Pointers to a commit. The difference is that tags never move, even when you commit -- they're fixed pointers to a single commit object in the tree.
HEAD, on the other hand, moves a lot. When you commit, it moves to point to that new commit object. When you
git checkout another branch, it moves to point to the same place that branch is pointing (...mostly). You can even
git checkout individual commits or tags, and your
HEAD will point to those commits.
As a side note, the commits branches are pointing to are often called the heads (lowercase) of those branches. I personally prefer the term "tips", because it's less likely to get confused with
HEAD, but calling them "branch heads" is much more common.
I did promise I'd get to merges eventually. How do they work, if commits are just tarballs of files?
Well, it has to do with one key invariant of a git repo that I've kinda glossed over up until this point: The root commit.
In a repo, every commit, if you follow the chain of parents back far enough, points to the same, original commit. Because of that, if you start at two disparate commits in the repo and follow their parents back far enough, you will eventually find a common ancestor. It might well be the root commit itself, but there'll be a commit that both commits came from.
Once you have that commit, you look at the state of the filesystem in it, compare it to the filesystem in the two child commits, and get the differences. If only one commit touched a given area, then you can merge those changes in 'safely'. If both commits touched an area, but made the same changes -- say, a commit was
cherry-picked between them to apply a security fix -- then you just apply those changes. If they made different changes to the same bit, then you flag it as a merge conflict and ask a human to handle it.
That's obviously just an overview, and there are multiple merge algorithms with different strengths and weaknesses. They all operate fundamentally the same, but the actual algorithms used to find the diff, decide what counts as a merge conflict, etc. are different.
Once all that is done, a new commit is made, with the changes from both branches integrated into the new snapshot. Generally, you're merging from one branch onto another, and the branch you're merging onto is the one whose tip advances. The new commit has two parents: The two commits that the branches pointed to when you started the merge.
There's actually a different type of merge, which you may have heard of, called a squashed merge. It's just like a normal merge, but rather than retaining the "from" branch as a parent, it only has the changes, all "squashed together" into a single commit. This is generally used when a very compact, very clean commit history is desired, but it loses the history which can be so useful in tracking down when a regression was introduced.
So what's a rebase then?
It's actually pretty dissimilar to a merge, even though they have the same end result of combining two sets of changes. While a merge leaves the branches as they are and adds a new commit with the new files, a rebase keeps the same number of commits, and moves them around.
When you rebase, you again need to pick a branch to rebase from and a branch to rebase onto. Again, their common base is found, and again, diffs are computed.
This time, though, the diffs are computed for every commit starting at the "from" branch tip and ending at the commit just ahead of the common base. Each of those diffs is then applied to the tip of the "onto" branch, each on its own commit. When the rebase is done, all of the old commits of the "from" branch (and the "from" branch itself) are deleted, leaving only the new commits on the "onto" branch.
All in all, not very complex, and nowhere near as scary as it might seem.
You may wonder -- if there's no marker saying when a branch ends, how does deleting a branch work?
I refer you back to the beginning of this section on branches, where I talk about how to determine what the root of a branch is.
So, in short:
- Repos are folders with some git metadata in them.
- Commits are basically tarballs of the whole repo, plus some extra metadata.
- Branches are really just pointers and you shouldn't think about that too much.
Hope you enjoyed! If you have questions, feel free to ask. This is a generic overview and probably didn't hit everything you might be curious about.