Git is by far one of the most useful tools if you are a software developer. Git is a version control system used by developers to keep track of changes made to a project's codebase. It allows multiple people to collaborate on the same codebase without overwriting each other's work.
But what’s even interesting is how Git manages to maintain the different versions of the code, down to an individual line. In this article we will take a look into how git actually handles the version control system
The .git folder
Whenever you initialise a project with git init
, a hidden folder called .git
. This is where git stores information regarding the project, git configurations and information about the versions. Let’s initialise a folder with git and see the contents.
git init
These are the contents of the .git
folder.
-
config
contains details about the repository configurations likeusername
andpassword
for the project etc. This file is overwritten with new properties. -
description
contains the description of the repository -
HEAD
This file maintains the reference to the current branch. At the moment it must be the master branch. -
refs
stores references to all the branches. -
objects
stores the data of the Git objects which contains contents of all the files checked in, commits etc.. -
hooks
contains shell script commands that are executed post git commands. -
info
contains information about the repository.
Git’s SHA1 hashing
Git stores data in the form of blob objects. Every blob in git is hashed with SHA 1
which are 20 bytes, represented by 40 hexadecimal characters. We can generate hashes for any content. For example, the SHA 1
hash for “Superman” is 5f42cf3e4992beffcd80266227d529427adb7a2d
. There is one and only one hash for the content “Superman”. You can check this for yourself using
❯ echo "Superman" | git hash-object --stdin
5f42cf3e4992beffcd80266227d529427adb7a2d
If we change the content, we get a completely new hash
❯ echo "SuperMan" | git hash-object --stdin
3b552a73712ce7111a4aa6a600f19700ae378f7a
This is the basis of what git does. It tracks the changes by generating a different hash each tome a change comes in.
A new commit
Now lets add a file called testfile
to this repo and commit it
❯ touch testfile
❯ git add testfile
❯ git commit -m "added test file"
Now notice the commit number: c36ed1c
. This is the blob object created for this commit. Git maintains the versions using something similar to a file system. It stores the content of the object, the commits in the form of a blob object. The difference between blob and file is that blob stores only the content, while file can store the metadata as well.
c36ed1c
is just the first seven letters of the actual hashed name. If we now go to the git folder we see the following:
Here you can see the full hashed name under folder c3
. The string starting with c3
and ending with 2b
is the full hash of the commit blob. But what about the other hashes? We did not create them. Well, we did create them… sort of. Let me explain.
Type the following command in the root of the project.
❯ git cat-file c36ed1cabb3565974f8846391f6ed59959f2d02b -p
This command shows the content of the hash, in this case it is the commit object.
As you can see, the content of this blob is added test file
which was our content for the commit message. It contains a reference to other hash c6dc3ef...
which contains the contents about the tree.
Again if we use git cat-file
in this hash, we get the following
This time it contains a reference to the testfile
and is stored as hash e69de29...
. Thus now we have the filename, the content of the file and the tree all stored in git in their hashed format.
Making changes
Now lets add some content in testfile
. Adding a simple text , “this is a test file” in the file. Now when we do git status
we see that this file has been modified because the respective generated hashes do not match with what is present in the .git
folder. This can be tracked down to individual line as git maintains a hash of the content at the lowest level (as seen above).
Lets commit this change.
❯ git add testfile
❯ git commit -m "second commit"
This time three new subdirectories are added under objects
.
Again using git cat-file
, we can see the contents of these new hashes.
This time, we have a new property for this hash - parent
. Since git commits work in the form of trees and each commit is a node, the previous commit becomes the parent of this commit. Therefore c36ed...
is the parent of the commit f51e8...
. Thus it becomes easier to track commits and their history is represented in the form of the relation between nodes.
Conclusion
Now we know how git maintains a version report of every single piece of content in the project. Git can then use these generated hashes and the tree to pinpoint exactly the content that is required down to the smallest level.
Thanks for reading! 👋
Cover photo by Photo by Praveen Thirumurugan
Top comments (2)
What's important I think is to understand Git stores "snapshots" of your working tree, and then computes the diffs when you need to think in terms of diffs (show me what this commit did, rebase that branch, cherry pick that commit, etc.)
Also, this storage as files named by the hashes of their content is no longer entirely accurate, as Git will actually optimize storage by "packing" things into archives (see git-scm.com/docs/git-gc which points to other plumbing commands), so you won't necessarily find the files in
.git/objects
(but can still access them withgit cat-file
)Good info to know. I'm adding this to my notes about git for future reference.