Đặng Đình Sáng

Posted on Dec 6, 2023

Understanding Git And Git Workflow

#beginners #webdev #git

Graph Theory

Graphs are all around us. The social networks that connect us, the infrastructure that powers our cities, and even the circuit diagrams that bring our technology to life are all examples of graphs. But what exactly is graph theory and how can understanding it help us make sense of the connected world?

I first learned about graph theory in one of my college math classes. It was sort of an abstract concept at first - just nodes and edges on a page. Graphs are everywhere.
The subway map in the city was a graph, with stations as nodes and train lines as edges. Your Facebook friends list was a graph. Even the wiring under the hood of your car forms a graph of connections. Graphs are a simple yet powerful way to represent any system made up of interrelated parts.

Formally, a graph is made up of vertices (also called nodes or points) connected by lines or arcs called edges. The vertices can represent people, computers, molecules - anything that can be connected to something else. The edges show the relationships or interactions between those things. A graph gives us a structured way to model relationships visually. Once you have a representation of a system as a graph, you can start to analyze it mathematically.

One of the basic questions graph theorists ask is: How connected is this graph? One way to measure that is through graph invariants like diameter, density, and connectivity. The diameter tells you the longest distance between any two vertices when moving along edges. Density measures how many edges there are relative to the total possible. Connectivity reveals whether the graph is made of one connected component or multiple disconnected subgraphs. Understanding these metrics helps us evaluate things like network resilience, traffic patterns, and disease spread.

Algorithms that operate on graphs also come up frequently in computer science. Shortest path problems, minimum spanning trees, and graph traversals all use graph representations. The algorithms power everything from GPS route planning and social networking recommendations to VLSI circuit design. I find it amazing that many complex real-world issues can often be boiled down to a graph and solved through algorithmic graph operations.

Graphs with Git

Git stores content in a data structure called the Git object database. At its core, this database represents Git's revision history as a directed acyclic graph (DAG) of commits (DAG vs. tree using Git). Each commit in the history is a vertex in the graph, while the parent-child relationships between commits form the edges. For example, when you create a commit, it will reference the commit(s) that came directly before it as its parent(s) - this establishes edges in the commit-graph.

Image source

This commit-graph data structure allows Git to do really powerful things. Since it's a graph, it can represent complex commit histories involving branches and merges as they happen in the real world of development. Traditional version control systems at the time stored histories as linear sequences, limiting how history could evolve. But Git's DAG lets histories branch and converge freely, closely mirroring how codebases are developed and maintained in real-life projects.

Other objects in the Git database like trees, blobs, and tags end up forming secondary graphs connected to the commit-graph. For example, each commit points to a tree object containing its content, while trees point to blob files. So the trees and blobs together give a snapshot of the code tree structure at each commit vertex in the main graph. These connected subgraphs allow traversing between versions of content, authors, timestamps, and more through the graph connections.

Image source

VISUALIZING YOUR GIT REPOSITORY

Source

The main point of using a visualizer is to help you make sense of your branch history. For example, to list all commits in your repository at the command line, you could do git log --oneline --abbrev-commit --all which will get you this flattened view

git log --oneline --abbrev-commit --all --graph

GARBAGE COLLECTION and GIT AMEND

You write some code and check it in. Then, you realize you forgot to run the tests, so you run them, and they uncover a syntax error. Or you spot a typo. For whatever reason, you weren't done when you thought you were done. Think like (a) git

When I just started learning git, I added a new commit for each change but these changes are doing the same feature. The first one would say "Add feature X", and the next several would have messages like "Fix typo" "Fix bug" or "Fix sonarqube". That makes my git graph a mess, and unclear for each commit. And then I didn't know which this commit was for which feature.

Git gives you another option: you can tack the new change onto the previous commit using git commit --amend. This keeps all of your related changes bundled together in one commit, so you can understand it more quickly when you're reviewing it later.

When you amend a commit in Git, it works by creating a brand new commit object that replaces the original. This new commit references the same tree as the old one but has changes like modified metadata or updated commit message. Behind the scenes, Git links the new commit back to the original's parent(s). But it doesn't delete the old commit object right away - that's where GC comes in.

Since Git is designed around its internal object database, it needs some housekeeping to avoid cluttering up disk space with stale, unreferenced objects over time. This is the job of Git's automatic garbage collector. By default, Git runs GC when running other commands if too many loose objects (not packed in a delta-compressed file) exist. It detects unreachable objects that aren't ancestors of any references like branches or tags.

These unreferenced commits from amend operations will be identified as garbage. GC then consolidates them into a packed file to reclaim disk space. This cleaning happens transparently without affecting your work.

source

You can rewrite the commit message with git commit --amend -m "new commit message" or keep the old one git commit --amend --no-edit

GIT CHERRY-PICK

source

A Git commit's ID is a hash of both its contents and its history. So, even if you have two commits that introduce the exact same change, if they point to different parent commits, they'll have different IDs.

What git cherry-pick does is take a commit from somewhere else, and "play it back" wherever you are right now. Because this introduces the same change with a different parent, Git builds a new commit with a different ID

If you were at node H in this graph, and you typed git cherry-pick E you'd wind up with a copy of commit E—let's call it "E prime" or E'—that pointed to H as its parent, like so:

Git Rules

Do works in a feature branch (create feature branches from the develop branch.
Never push into the develop or master branch. Make a Merger Request(Gitlab)/ Pull Request(GitHub).
Update your local develop branch and do an interactive rebase before pushing your feature and making a PR/ MR.
Resolve potential conflicts while rebasing and before making a PR/ MR.
Delete local and remote feature branches after merging (Feature branches should only exist while the work is still in progress, avoid dead branches).
Before making a PR/MR, make sure your feature branch builds successfully and passes all tests in CI/ CD.
Use .gitignore file (system files/env files.. that should not be sent with your code into a remote repository).
Protect master, stage, develop branches (Avoid unexpected and irreversible changes).

Git Workflow

Start with a new project.

cd <path_to_project_folder>
git init
git checkout -b master
git add .
git commit -m "Initial commit"
git remote add origin <git_url/ssh_url>
git push origin master

Check out a new feature branch.

git checkout -b <branch_name>

Add changes.

git add -p
git commit

git add -p will give you a chance to review all of the introduced changes one by one and decide whether to include them in the commit or not. git commit will start an editor that lets you separate the subject from the body.

Sync with the remote repository to get changes of anyone in your team just merged into develop.

git checkout develop
git pull

Update the feature branch with the latest changes from the develop branch by interactive rebase.

git checkout <branch_name>
git rebase -i --autosquash develop

You can use --autosquash to squash all your commits to a single commit/ or not based on your team. For me, 1-4 commits for each feature.

If you don’t have conflicts, just go ahead. If you have, resolve them and continue to rebase.

git add .
git rebase --continue

Push to feature branch after rebasing. Rebase will change history, so you'll have to use -f to force changes into the remote branch.

git push origin <your_current_feature_branch_name> -f

Create a PR/ MR.
PR/MR will be accepted or commented on, merged, and closed by a reviewer. If the reviewer commented, based on it you have to update the code or not then use git amend to append modify to the previous commit or create a new commit.
Remove your local feature branch if you're done.

git branch -D <branch_name>

Remove all branches that are no longer on remote.

git fetch -p && for branch in git branch -vv --no-color | grep ': gone]' | awk '{print $1}'; do git branch -D $branch; done

Conclusion

I hope this discussion gave you some useful insight into the interesting connections between Git, its underlying data model, graph theory concepts, and Git workflow. While version control systems and graph databases may seem like different topics, at their core they are both capturing and querying directed acyclic graphs of information.

Git's revolutionary approach of modeling code history as a graph is what allows it to represent complex project histories flexibly. This relies on graph theory constructs like vertices, edges, and parent-child relationships between commits. Secondary subgraphs and garbage collection keep the object database organized over time.