DEV Community

Mark Edosa
Mark Edosa

Posted on

How to Traverse (Walk) Multiple Directories Using Python

Introduction

This article shows you how to create a directory walker that extracts files from multiple directories and subdirectories. The walker initially uses a combination of recursion and a for loop and then uses a generator/iterator to save time and memory. I assume you are familiar with basic Python programming, functions, generators, and iterators. Let's get started.

Directory Structure

A directory is typically represented as a tree data structure where the topmost directory sits as the root node. The root node can contain zero, one or more subdirectories (child nodes), and files (leaf nodes). The subdirectories can also contain other directories and files.

The diagram below shows a Desktop (root) directory containing three subdirectories - Work, Music, and Videos.

Directory Tree Structure

To extract the contents of all directories, you will need to visit each directory using a loop or recursion. These two methods have their advantages and disadvantages. However, this article will not discuss them. You can check out this post on StackOverflow for more information.

The Path Class from the PathLib Module

The pathlib module provides utilities for working with the filesystems of different operating systems. The Path class, an alternative to os.path, provides a high-level, friendly way to assess file paths while avoiding the many ceremonies associated with the os.path module. For example:

from pathlib import Path

current_dir = Path('.')

print(current_dir) # .
Enter fullscreen mode Exit fullscreen mode

Assessing and navigating subdirectories is as easy as it gets. For example:

# List all subdirectories
subdirs = [x for x in current_dir.iterdir() if x.is_dir()]
print(subdirs) # [WindowsPath('Desktop/Music'), ...]

## List all songs within Desktop
all_songs = list(current_dir.glob('**/*.mp3'))
print(all_songs) # [WindowsPath('Desktop/Music/Hip Pop/song3.mp3'), ...]

# Assuming my current directory is "Desktop"
work_dir = current_dir / "Work"
print(work_dir) # Desktop\Work or Deskop\Work (Linux)

comedy_dir = current_dir / "Videos" / "Comedy"
print(comedy_dir) # Also OS specific

# Print a tuple containing parts of the directory
print(comedy_dir.parts) # ('Desktop', 'Videos', 'Comedy')
Enter fullscreen mode Exit fullscreen mode

If you simply want to extract a specific file type, you can use the .glob() method mentioned above. However, if you want more control over the process, then consider using a loop or a generator function.

You can also check if the path is a directory using the .is_dir() method or if a path exists using the .exists() method. For example:

print(current_dir.is_dir()) # True

dont_know = current_dir / "Created"
print(dont_know.exists()) # False
Enter fullscreen mode Exit fullscreen mode

Walking Directories

Using A Loop and Recursion

I assume you've created the example directory structure in the diagram above. Let's look at a sample code.

from pathlib import Path
from typing import Callable


def song_strategy(filename: str) -> bool:
    """Check if the file is an mp3 that does not contain certain words."""

    # ignore file if any of these words are in the file name
    ignore_words = ['slow', 'jazz', 'old']

    if filename.endswith('mp3'):
        res = [kw for kw in ignore_words if filename.lower().find(kw) != -1]
        return len(res) == 0

    return False


def collect_files(root_dir: Path, file_strategy: Callable[[str], bool]) -> list[Path]:
    """Collect files by walking multiple directories."""

    files: list[Path] = []

    # directories to ignore
    ignore = {'work', 'videos', 'reggae'}

    def inner(root_dir):
        for x in root_dir.iterdir():
            if x.is_dir() and (x.parts[-1].lower() not in ignore):
                # recursion
                inner(x)
            else:
                # use the strategy
                if file_strategy(x.parts[-1]):
                    files.append(x)

    inner(root_dir)

    return files

# The current path is "Desktop"
print(collect_files(Path('.'), song_strategy))
Enter fullscreen mode Exit fullscreen mode
  • collect_files() takes a root_dir path and a file_strategy() function that filters files.
  • A strategy, song_strategy() is an example of a file_strategy() function that selects only mp3 files. You can easily add others!
  • collect_files() uses an inner() function to recurse through each directory and collects the selected file in an array of files.
  • Note that a recursion must have a termination condition to prevent stack overflow. In this example, the if x.is_dir() ... ensures that recursion ends once all directories have been transversed.

While collect_files() works, its time and space efficiency drastically reduces as the number of directories to transverse increases, especially when dealing with thousands of directories.

Using A Generator Function

A solution to the above problem is to replace collect_files() with an alternative function that uses a generator. The function is defined below.

def collect_files_generator(root_dir: Path, file_strategy: Callable[[str], bool]):
    """Collect files by walking multiple directories using a generator"""
    # directories to ignore
    ignore = {'work', 'videos', 'reggae'}

    def inner(root_dir):
        for x in root_dir.iterdir():
            if x.is_dir() and (x.parts[-1].lower() not in ignore):
                # yield from a generator
                yield from inner(x)
            else:
                if file_strategy(x.parts[-1]):
                    yield x

    yield from inner(root_dir)

Enter fullscreen mode Exit fullscreen mode
  • collect_files_generator() is a generator that uses a sub-generator inner().
  • inner() is a sub-generator that recursively uses itself.
  • collect_files_generator() is better than collect_files() because it produces values one by one (or lazily) instead of storing them in files (memory) before returning them.

While I assume that you already know how a generator works, this post is a great recipe on how yield from works!

Since a generator produces an iterator, you can control how you retrieve each element from the iterator. For example

# loop through the iterator
for file in collect_files_generator(Path('.'), song_strategy):
    print(file)

# Or

# Automatically extract the iterator content as a list
print(list(collect_files_generator(Path('.'), song_strategy)))
Enter fullscreen mode Exit fullscreen mode

Summary

In this article, you saw how

  • directories are traversed using loops and recursions,
  • the space and time efficiency can be improved using a generator function rather than a normal function,
  • directories are represented and how to use the Path class from the pathlib module,
  • to use a strategy function to filter files.

Thanks for reading.

Top comments (0)