Vincent A. Cicirello

Posted on Aug 31, 2022

Configuring GitHub's Linguist to Improve Repository Language Reporting

#github #tutorial #programming

In this post, I explain how to configure GitHub's Linguist within your repository to enable more accurate and more relevant repository language reporting, with examples from a few of my own repositories. Every repository on GitHub has a chart that shows the distribution of languages detected in the repository. GitHub's Linguist is responsible for detecting the language of each file within your repository, and the reported percentages are based on file sizes. For example, "Java 50%" means that 50% of the total size of all detected files in the repository are Java files. There are also third party tools that display language statistics, such as the user-statistician GitHub Action that I developed and maintain, which includes on an SVG (among other things) a pie chart summarizing the language distribution across all of your public repositories (excluding forks). The language data necessary to generate that language chart comes from GitHub's GraphQL API, which is as it is reported for each of your repositories by Linguist.

For examples of the language charts generated by user-statistician, see my DEV post from last week:

The user-statistician GitHub Action mentioned in Awesome-README

Vincent A. Cicirello ・ Aug 25 '22

#github #showdev #tutorial #webdev

Here are a couple examples from my repositories of the language charts built-in to every GitHub repository:

GitHub Language Chart From https://github.com/cicirello/InteractiveBinPacking

GitHub Language Chart From https://github.com/cicirello/Chips-n-Salsa

What can you do if the reported languages are not as you expect? The remainder of this post explains, and provides examples of how you can configure Linguist in your repository for those cases where what Linguist reports is not as you expect.

Contents: The rest of this post is organized as follows:

Linguist's Defaults
How to Configure Linguist in Your Repository
Find Out More
Where You Can Find Me

Linguist's Defaults

Linguist automatically excludes a variety of things, including entire categories of languages, but it is possible to override all of its defaults. Linguist has each language classified into one of the following language types: programming, markup, data, and prose. You can find how each language is classified in Linguist's languages.yml. By default, Linguist includes in a repository's language statistics only programming languages and markup languages; while it excludes data languages and prose languages. An example of a prose language is Markdown. If not for Linguist's default exclusion of all prose languages, nearly every repository would have Markdown in its language chart due to the pervasiveness of Markdown's use for documenting projects. A few examples of common languages that Linguist classifies as data languages include XML, JSON, YAML, SQL, and GraphQL. So unless you configure Linguist in your repository, all of these, as well as other data languages will be excluded.

Linguist also excludes files within paths that are commonly used for documentation, such as all files within a docs directory. This is certainly desirable behavior. Imagine that you have a Java project, and that you are serving the javadocs via GitHub Pages from a docs directory in your default branch. If not for excluding documentation, HTML might be identified as a significant percentage of the repository, which would be a bit strange in such an instance.

Linguist also excludes, by default, any code that it detects as either generated or vendored code. Linguist has detailed documentation on each of these categories, along with how you can override its default behavior.

How to Configure Linguist in Your Repository

All of Linguist's default behavior can be overridden. Here are some examples of how to do some overrides. The first step is creating a file named .gitattributes at the root of your repository (if you don't already have one for another reason). All configuration takes place in that .gitattributes file.

Misidentified Language

I haven't encountered a case of incorrect language identification yet. But if you do, you can correct it. Perhaps you are using an unusual file extension for a given language. Since I haven't seen this case yet, my example of how to fix it is fake. Let's say you have some reason to use the extension .j for Java. I can't think of a good reason to do this, or even bad reasons for that matter, so don't actually use such an extension. There is no way that Linguist will get this right on its own. But you can direct it to classify such files as Java with:



*.j linguist-language=Java

Including A Data Language

As mentioned, Linguist excludes data languages by default, including (among others) XML, JSON, YAML, SQL, and GraphQL. In most cases, you probably do want to exclude these, especially languages like XML, JSON, and YAML that are commonly used for configuration data. One of my projects is the user-statistician GitHub Action. To assist new users setting up workflows to use it, the repository has a directory with Quickstart Workflows, each of which is a YAML file, the language used by GitHub Actions to specify CI/CD workflows. Since YAML is classified as a data language, all of these quickstart workflows are excluded from the language statistics by default. That project also has a few GraphQL files with GraphQL queries. GraphQL is likewise excluded by default as a data language. In this repository, I have configured Linguist to include both of these with the following in that repository's .gitattributes file:



*.graphql linguist-detectable
quickstart/*.yml linguist-detectable

I used quickstart/*.yml linguist-detectable instead of *.yml linguist-detectable because the latter would include yml files from the .github/workflows directory, which are CI/CD workflows for this repository; whereas those that I put in the quickstart directory are there as examples of how to use the action.

In general to include a data language (or a prose language), which would be otherwise excluded, add a line to the .gitattributes with a pattern describing the files you want it to include followed by linguist-detectable.

Excluding a Language or Directory

Perhaps there is a language, or maybe just a directory, you'd like to exclude. There are multiple ways to accomplish this. Which you should use likely depends upon the reason to exclude it. As noted earlier, Linguist excludes documentation by default, provided it is able to detect something to be documentation such as if it lives in a common documentation path, like docs.

For example, one of my repositories, InteractiveBinPacking, is an educational tool implemented in Java, with a few HTML files for contents of dialog boxes, etc, and also has a directory of example assignments with LaTeX source to enable course instructors to easily customize assignments. HTML and LaTeX are both classified as markup languages, and Java obviously as a programming language so those are all included by default, so a language chart with Java, HTML, and TeX makes sense. So far, no configuration necessary. I published a short journal article about the tool in the Journal of Open Source Education. That journal conducts the peer review within the repository itself, with a paper directory holding a Markdown file with the content of the paper, and usually a BibTeX file with the citation data for the references of the paper. Markdown is automatically excluded as prose, which is fine here. However, the BibTeX file would by default be included in the TeX count. The directory of example assignments in LaTeX is part of the purpose of the repository, but this BibTeX file is in a sense part of the documentation of the tool.

I could exclude it with:



*.bib -linguist-detectable

Notice the - in the above. Just as linguist-detectable can be used to direct Linguist to include a language it normally excludes, -linguist-detectable can be used to direct it to exclude a language it normally includes. Instead, I went with a more semantic approach, and excluded the paper directory by specifying that it is documentation with the following (you can also see the .gitattributes of that project directly):



paper/* linguist-documentation

Either of these works. If the reason you want to exclude a language that is otherwise included by default is because it is part of documentation, then the latter approach better expresses your intent.

Find Out More

The language charts on the SVGs generated by the user-statistician GitHub Action, rely on the language data extracted by Linguist as reported by GitHub's GraphQL API. For more information on that feature of the user-statistician, or if you are interested in using that action, see its GitHub repository:

cicirello / user-statistician

Generate a GitHub stats SVG for your GitHub Profile README in GitHub Actions

user-statistician

Check out all of our GitHub Actions: https://actions.cicirello.org/

About user-statistician

GitHub Actions
Build Status
Source Info
Contributors
Support

The cicirello/user-statistician GitHub Action generates a detailed visual summary of your activity on GitHub in the form of an SVG suitable to display on your GitHub Profile README Although the intended use-case is to generate an SVG image for your GitHub Profile README, you can also potentially link to the image from a personal website, or from anywhere else where you'd like to share a summary of your activity on GitHub. The SVG that the action generates includes statistics for the repositories that you own, your contribution statistics (e.g., commits, issues, PRs, etc), as well as the distribution of languages within public repositories that you own. The user stats image can be customized, including the colors such as with one of the built-in themes or your own set of custom…

View on GitHub

For additional examples of how you can configure Linguist, see Linguist's documentation.

Where You Can Find Me

On the Web:

Vincent A. Cicirello - Professor of Computer Science

Vincent A. Cicirello - Professor of Computer Science at Stockton University - is a researcher in artificial intelligence, evolutionary computation, swarm intelligence, and computational intelligence, with a Ph.D. in Robotics from Carnegie Mellon University. He is an ACM Senior Member, IEEE Senior Member, AAAI Life Member, EAI Distinguished Member, and SIAM Member.

cicirello.org