If you're like me and have less than fifteen years of software engineering experience, the thought of a world without Git doesn't seem possible. When I started to research for this post, I almost fell out of my chair when I read that Git was created in 2005. It doesn't seem that long ago ... either that, or I'm simply getting old. :) Regardless, I often find myself being scared of certain Git commands. Do I
rebase, or do I
merge? What is the use case for a
force push? There have definitely been a few occasions when a wrong Git command turned into a big deal. So, I decided to bite the bullet and learn what is going on under that magical hood.
Git is a version control system that is distributed, which means that it uses multiple local repositories, including a centralized repo and server. Before distributed systems, subversion (SVN) was a popular way to manage code version control. Unlike Git, it is centralized rather than distributed. With SVN, your data is stored on a central server, and any time you check it out, you're checking out a single version of the repository.
While most of us remember Git as the first distributed version control system, before Git, there was BitKeeper, a proprietary source control management system. Created in 1998, BitKeeper was spun up to solve some of the growing pains of Linux. It offered a free license for open-source projects, with the stipulation that developers could not create a competing tool while using BitKeeper plus one additional year. I'm sure you can guess what happened. In the early-to-mid 2000s, there were a plethora of license complaints, and in 2005, the free version of BitKeeper was removed. This prompted Linus Torvalds to create Git, which he named after a British slang word that means "unpleasant person." Linus Torvalds turned the project over to Junio Hamano (a major contributor) after its original v0.99 release, and Junio remains the core maintainer of the project. Fun Fact: The most recent version of Git was released on July 27th, 2020, and is version 2.28.
If you want to read more about BitKeeper, check out the Wikipedia page here -- it is no longer being developed.
While Git has morphed into a full-fledged version control management system, this wasn't the original intent. Linus Torvalds said the following on this topic:
In many ways, you can just see Git as a filesystem -- it's content-addressable, and it has a notion of versioning, but I really designed it coming at the problem from the viewpoint of a filesystem person (hey, kernels is what I do), and I actually have zero interest in creating a traditional SCM (source control management) system.
Side note: In case you're wondering what "content-addressable" means, it is a way to store information, so it can be retrieved based on content rather than location. Most traditional local and networked storage devices are location addressed.
Git has two data structures:
- a mutable index (i.e., a connection point between the object database and the working tree) and
- an immutable, append-only object database.
There are five types of objects:
- blob: this is the content of a file.
- tree: this is the equivalent of a directory
- commit: this links tree objects together to form a history
- tag: this is a container that contains a ref to another object, as well as other metadata
- packfile: zlib version compressed of various other objects
Each object has a unique name, which is a SHA-1 hash of its contents.
To better understand how all of this fits together, let's create a dummy project directory and run
Open your terminal, and create a new directory. Then, run
git init. You should then see something similar to the following output:
➜ Documents mkdir understanding-git ➜ understanding-git git init Initialized empty Git repository in /Users/juliekent/Documents/understanding-git/.git/ ➜ understanding-git git:(master)
I am sure you have done this many times but may not have really cared to know what was actually in the newly created
.git directory. Let's check it out. If you run
ls -a via your terminal, you will see the
.git directory. By default, it is a hidden directory, which is why you need the
-a flag. Place
cd .git into the directory, and then run
ls. You should see something like this:
➜ .git git:(master) ls HEAD config description hooks info objects refs
We will be focusing on
refs directories. We will also run some commands so that we have
index files, but this will come later. The
description file is only used by the GitWeb program. The
config file is pretty straight forward, as it contains project configuration options. The the
info directory keeps a global exclude file for ignored patterns you don't want to track, which is based on the
.gitignore file; I'm sure most of you are familiar with it.
Let's start with the
objects directory. To see what is created, run
find .git/objects. You should see the following:
➜ understanding-git git:(master) find .git/objects .git/objects .git/objects/pack .git/objects/info
Next, let's create a file:
echo 'this is me' > myfile.txt
This creates a file named
this is me.
Now, let's run the command
git hash-object -w myfile.txt.
Your output should be a random mix of numbers and letters -- this is a SHA-1 checksum hash. If you're not familiar with SHA-1, you can read more here.
Next, copy your SHA-1, and run the following command:
git cat-file -p (insert your SHA here)
You should see "this is me", the contents of your file that was created. Cool! This is how content-addressable Git objects work; you can think of it as a key-value store where the key is the SHA-1, and the value is the contents.
Let's write some new content to our original file:
echo 'this is not me' > myfile.txt
Then, run the
hash-object command again:
git hash-object -w myfile.txt
You now have two unique SHA-1s for both versions of this file. If you want further proof, run
find .git/objects -type f, and you should see both via your terminal window.
If you'd like to learn more about how other objects in Git work, I recommend following this tutorial.
Let's move onto refs. When running
find .git/refs, you should see the following output:
➜ understanding-git git:(master) ✗ find .git/refs .git/refs .git/refs/heads .git/refs/tags
As we saw in the previous section about objects, we know that Git creates unique SHA-1 hashes for each one. Of course, we could run all of our Git commands utilizing each object's hash. For example,
git show 123abcd, but this is unreasonable and would require us to remember the hash of every object.
Refs to the rescue! A reference is simply a file stored in
.git/refs containing the hash of a commit object. Let's go ahead and commit our
myfile.txt, so we can better understand how refs work. Go ahead and run
git add myfile.txt and
git commit -m 'first commit'. You should see something like this:
➜ understanding-git git:(master) ✗ git add myfile.txt ➜ understanding-git git:(master) ✗ git commit -m 'first commit' [master (root-commit) 40235ba] first commit 1 file changed, 1 insertion(+) create mode 100644 myfile.txt
Now, let's navigate to the
.git/refs/heads directory by running
cd .git/refs/heads. From there, run
cat master. You should see the SHA-1. Finally, run
git log -1 master which should output something similar to the following:
commit Unique SHA-1 (HEAD -> master) Author: Julie <firstname.lastname@example.org> Date: Mon Aug 3 15:59:59 2020 -0500 first commit
Cool! As you can see, branches are simply just references. When we change the location of the master branch, all Git has to do is change the contents of the
refs/heads/master file. Likewise, creating a new branch creates a new reference file with the commit hash.
Helpful hint: If you ever want to see all references, run
git show-ref, which will list all references.
HEAD is a symbolic reference. You might wonder, when running
git branch <branch>, how Git knows the SHA-1 of the last commit. Well, the HEAD file is usually a symbolic reference to your current branch. You might be thinking to yourself, "You keep saying symbolic; what does that mean?" Great question! Symbolic means that it contains a pointer to another reference. If your head is spinning, I'm with you. It took me quite a bit of Googling and reading to finally understand what exactly
HEAD is. Here is a great analogy, pulled from this website
A good analogy would be a record player and the playback and record keys on it as the HEAD. As the audio starts recording, the tape moves ahead, moving past the head by recording onto it. The stop button stops the recording while still pointing to the point it last recorded, and the point that record head stopped is where it will continue to record again when Record is pressed again. If we move around, the head pointer moves to different places; however, when Record is pressed again, it starts recording from the point the head was pointing to when Record was pressed.
Go ahead and run:
cat .git/HEAD, and you should see something like this:
➜ understanding-git git:(master) cat .git/HEAD ref: refs/heads/master
This makes sense because we are on the master branch. HEAD is, essentially, always going to be the reference to the last commit in the currently checked-out branch.
Helpful Tip: You can run
git diff HEAD to view the difference between HEAD and the working directory.
We have covered a lot in this post! We've learned a bit of fun history regarding how Git came about and examined the main plumbing that makes all of the magic happen! If you want to continue to dive deeper into Git, as well as better understand how some of the common commands work, I highly recommend the book titled "Pro Git", which is available for free here.