DEV Community

Cover image for Building a File Analysis Dataset with Python
Matt Eland
Matt Eland

Posted on • Originally published at

Building a File Analysis Dataset with Python

Last year I devised some ways of analyzing the history and structure of code in a visual way, but I didn't document much of that to share with the community. This article explores the process I used to build a CSV file containing the path, extension, project, and line of code count in all source files in a project.

Note: if the idea of getting data out of a git repository interests you, check out my article on Using PyDriller to Extract git Information

This code is effective if you want to quickly analyze code files you have on your hard drive and build a dataset for further analysis and data visualization.

Much of the Python code I'm sharing is heavily borrowed from various StackOverflow answers with my own tweaks and changes. I have tried to credit all sources I used, but this code was built a year ago and I may have forgotten a source or two that was helpful.


I wrote this code using Python 3.8 running on a Jupyter Notebook running Python 3.8. However, Jupyter Notebooks are not required for this code to function.

The code relies on the os library, which is standard to Python, as well as the pandas library for analyzing tabular data.

import pandas as pd
import os
Enter fullscreen mode Exit fullscreen mode

The Code

What follows is a sequential breakdown of the functions needed to accomplish the task of extracting data from the code.

Identifying Files

Analyzing files cannot be done without determining what files to analyze.

I accomplish this using the os library's listdir method along with several other path and directory methods to recursively traverse the directory tree starting in a given directory.

I also intentionally ignore directories known to contain a large number of ignorable files such as reporting directories, the .git directory, and directories used by the development environment.

# Some directories should be ignored as their output is not helpful
ignored_directories = ['.git', '.vs', 'obj', 'ndependout', 'bin', 'debug']

def get_file_list(dir_path, paths=None):
    Gets a list of files in this this directory and all contained directories

    files = list()
    contents = os.listdir(dir_path)

    for entry in contents:
        path = os.path.join(dir_path, entry)
        if os.path.isdir(path):
            # Ignore build and reporting directories
            if entry.lower() in ignored_directories:

            # Maintain an accurate, but separate, hierarchy array
            if paths is None:
                p = [entry]
                p = paths[:]

            files = files + get_file_list(path, p)
            files.append((path, paths))

    return files
Enter fullscreen mode Exit fullscreen mode

Determining if a file contains source code

In order to determine if a file is a source code file and should be included in the results, I chose a naive approach where I only pay attention to files with certain extensions.

This code could benefit from a more comprehensive list or even perhaps some advanced analysis of the contents of the files. However, simply looking at the extension was fine for my purposes and should serve most readers well.

# Common source file extensions. This list is very incomplete. Intentionally not including JSON / XML
source_extensions = [

def is_source_file(file_label):
    Defines what a source file is.
    file, _ = file_label
    _, ext = os.path.splitext(file)

    return ext.lower() in source_extensions
Enter fullscreen mode Exit fullscreen mode

Counting Lines of Code

After that, I needed to be able to quickly determine the length of a file.

I used a Stack Overflow answer to determine how to efficiently read the number of lines in a file using the following code:

def count_lines(path):
    Reads the file at the specified path and returns the number of lines in that file

    def _make_gen(reader):
        b = reader(2 ** 16)
        while b:
            yield b
            b = reader(2 ** 16)

    with open(path, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(

    return count
Enter fullscreen mode Exit fullscreen mode

This code uses a buffered reader to quickly read through the file buffer and return the total count.

Geeks for Geeks offers a slower but easier to understand implementation for those interested:

def count_lines_simple(path):
    with open(path, 'r') as fp:
        return sum(1 for line in fp)
Enter fullscreen mode Exit fullscreen mode

Getting File Metrics

After this, I built a function that could take multiple files and generate a list of detail objects for all the files in the list.

Each file details object contained:

  • The root directory the analysis started in
  • The full path of the file
  • The project the file is in (the base directory in the project the file lives inside of)
  • The path of the file relative to the
  • The name of the file
  • The file's extension
  • The number of lines in that file

This is accomplished by looping over all files and folders in the incoming files parameter, then counting the lines of the files using the count_lines function, enumerating any folders, and building path information on each file.

Once all information is known, a file details object is created and appended to the resulting list that is returned once all files have been analyzed.

def get_file_metrics(files, root):
    This function gets all metrics for the files and returns them as a list of file detail objects
    results = []

    for file, folders in files:
        lines = count_lines(file) # Slow as it actually reads the file
        _, filename = os.path.split(file)
        _, ext = os.path.splitext(filename)

        fullpath = ''

        if folders != None and len(folders) > 0:
            project = folders[0]
            for folder in folders[1:]:
                if len(fullpath) > 0:
                    fullpath += '/'
                fullpath += folder
            project = ''

        if len(fullpath) <= 0:
            fullpath = '.'

        id = root + '/' + project + '/' + fullpath + '/' + filename

        file_details = {
                        'fullpath': id,
                        'root': root,
                        'project': project,
                        'path': fullpath,
                        'filename': filename,
                        'ext': ext,
                        'lines': lines,

    return results
Enter fullscreen mode Exit fullscreen mode

Putting it all together

Finally, I built a central function to kick off source analysis in a given path:

def get_source_file_metrics(path):
    This function gets all source files and metrics associated with them from a given path
    source_files = filter(is_source_file, get_file_list(path))
    return get_file_metrics(list(source_files), path)    
Enter fullscreen mode Exit fullscreen mode

This function calls to the other functions to build the resulting list of file details.

I could call this method by declaring a list of directories I'm interested in, and then calling it on those directories.

# Paths should be one or more directories of interest
paths = ['C:/Dev/MachineLearning/src', 'C:/Dev/MachineLearning/test']

# Pull Source file metrics
files = []
for path in paths:
Enter fullscreen mode Exit fullscreen mode

Here I'm analyzing all projects in the src and test directories of the ML.NET repository. I chose to include these as separate paths because they represent two different groupings of projects in this repository.

Saving the Results

Once the files list has been populated, it can very easily be used to create a Pandas DataFrame for tabular analysis. The DataFrame also offers an easy method to serialize the data to a CSV file as shown below:

# Grab the file metrics for source files and put them into a data frame
df = pd.DataFrame(files)
df = df.sort_values('lines', ascending=False)

# Write to a file for other analyses processes
Enter fullscreen mode Exit fullscreen mode

Finally, we can preview the first 5 rows of that dataset using the df.head() method.

In my sample dataset, this produces the following results:

File Data Results in a DataFrame

Next Steps

Now that you have hierarchical data saved to a CSV file, and I've illustrated a separate way to get information out of a git repository, the next steps will involve merging this data together for analysis and visualization.

Stay tuned for future updates in this series on Visualizing Code.

Top comments (0)