I recently organized my pinned repositories on GitHub and noticed that the language shown for one of my repositories didn't quite seem right. It indicated
HTML but I was expecting
I did some digging to figure out how GitHub determines the language for the repository as well as looking at how I can change the language shown.
Once you push changes to a repository on GitHub, the Linguist does its thing with a low-priority background job that will go through all of the files to determine the language of each file. Some things to note:
- all of the languages it knows about are listed in languages.yml
- excluded files include binary data, vendored code, generates code, documentation, files with either
data(ie SQL) or
prose(ie Markdown) languages, and explicit language overrides.
To determine the language for each remaining file, the Linguist employs the seven strategies listed below, done in the same order. Each step will either identify the exact language or will reduce the number of possible languages that get passed down to the next strategy.
- Vim or Emacs modeline
- commonly used filename
- shell shebang
- file extension
- XML header
- naïve Bayesian classification
The results are then used to produce the language stats bar that shows the languages and its respective percentages that make up the repository. The percentage is determined by the bytes of code for each language as indicated by the List Languages API. The language shown for all of my pinned repos up top is the majority language.
Also, I was today years old when I found about the language stats bar. If you’re wondering where it is, it’s the colorful bar up at the top of your repository just under the commits/branches/etc. bar. Those colors indicate the languages that make up your repo, and click on it to get the full breakdown. 🤯
Now that we know the background of how GitHub determines the repository language, I’ll show you how to change the language shown using
- Create a
.gitattributesfile in your repo at the top-level
Edit the file and add the below line, subbing in the language(s) you want ignored denoted by its file extension before
linguist-detectable=false. Since I want HTML ignored, I’ve included HTML below.
Add, commit, and push the changes