This week I learned a little bit more about git. The TL;DR: If the standard
git diff command isn't outputting enough detail, try
git diff --word-diff or
git diff --word-diff-regex=<your regex here>
Git is a tool that, for me, falls into the category of constantly learning. Of course, this is true with just about everything related to software engineering, but with git it feels especially so. There are countless workflows, tools, and GUIs; everyones interactions with git are unique. This last week I was having to manually edit a column of data in a CSV. An example from that column looked something like this:
There were a bunch of integers ending in a
G, which I needed to drop. The client had been adding the
G to make the number unique, but with the app we are building for them, we have other ways to ensure uniqueness. If we drop the
G the data becomes cleaner overall, and
JOINing tables on that column becomes far less complicated. On the surface, the task of dropping the appended letter seemed easy enough, but what was causing me problems was confirming that I wasn't affecting any other columns in the fairly large CSV (3000+ rows, 20+ columns). After I ran a find-and-replace to drop the
G from all the data in that column, I ran a basic
git diff on the file, and my output was this:
Basically, the comparison was unusable. Because of how the raw CSV file was being diffed by git, and because I changed an attribute in one field, that whole line had changed as git saw it.
Git was representing my one character change as the entire line of the raw CSV changing.
Being overly confident in my find-and-replace abilities, I added the modified CSV to a commit and pushed up the PR. It didn't take too long for a teammate to add a comment to the PR, showing where I had inadvertently altered the data in another column. His comment meant I had two tasks ahead of me: make my find-and-replace more granular, and find a method to get a more detailed diff between the two files. I knew I could figure out the find-and-replace, but I was a bit perplexed by how detailed his diff was when he included a screenshot in his PR comment.
After a quick Slack discussion, it was literally a "well, duh, of course," moment for me. Of course there would be options you could pass to
git diff, there are almost always options to pass into commands to get the desired output. But for whatever reason, I just was not in that train of thought and I was taking
git-diff at face value.
I'll get to the the actual git options in a minute, but what was surprising to me is that Github displays a more detailed diff than the default
git diff I'm running on my machine. That meant if I were to add the changes as a new commit, push that commit up to the PR, the diff view between those two SHAs would be more granular than what I was getting in my terminal. This is that same file diffed with Github:
Of course, there are ways to get a similar output locally. I went with the option
--word-diff-regex=, documentation can be found here. I still can't come up with regexes out of thin air, but luckily the web is full of them. Within the first few search results I found a solution so elegant it honestly made me laugh out loud. This diff command:
git diff --word-diff-regex=.
Produced this output:
Honestly, I have never come across such simple regex that worked so well! According to the regex helper tool regexer.com the period -or dot-
matches any character except line breaks. So using that as a regex is basically forcing the diff tool output to show the diff to the granularity of each character. For me that was great, but my colleague is always looking for more than great, and he passed along a few other regex options that were as good, if not better. The first was this:
git diff --word-diff-regex="[^,]+"
Which produced this:
And there's also this:
git diff --word-diff=color --word-diff-regex="([A-Z]+|[0-9]+)"
Which produced this output:
I'm not going to dive into each of those regexes, regexer.com will do a much better job of explaining. The takeaway is that you can get very detailed
git diff comparisons, and
--word-diff-regex= is one way to accomplish that. Also, scrolling down the
git diff documentation is a nice reminder that the available options for standard git commands are vast.
I'm happy with the output I got from the git-provided option. It's also worth mentioning that there are a few packages and git config add ons that can help with more granular diffing. These are the three options that came up in conversation with the team around this:
This post is part of an ongoing This Week I Learned series. I welcome any critique, feedback, or suggestions in the comments.