Bohdan Stupak

Posted on Apr 28, 2021 • Originally published at wkalmar.github.io

Batch processing with Directory.EnumerateFiles

#csharp #dotnet #performance

In case one wants to retrieve files from catalog Directory.GetFiles is a simple answer sufficient for most scenarios. However, when you deal with a large amount of data you might need more advanced techniques.

Example

Let's assume you have a big data solution and you need to process a directory that contains 200000 files. For each file, you extract some basic info

public record FileProcessingDto
{
    public string FullPath { get; set; }
    public long Size { get; set; }
    public string FileNameWithoutExtension { get; set; }
    public string Hash { get; internal set; }
}

Note how we conveniently use novel C# 9 record types for our DTO here.

After that, we send extracted info for further processing. Let's emulate it with the following snippet

public class FileProcessingService
{
    public Task Process(IReadOnlyCollection<FileProcessingDto> files, CancellationToken cancellationToken = default)
    {
        files.Select(p =>
        {
            Console.WriteLine($"Processing {p.FileNameWithoutExtension} located at {p.FullPath} of size {p.Size} bytes");
            return p;
        });

        return Task.Delay(TimeSpan.FromMilliseconds(20), cancellationToken);
    }
}

Now the final piece is extracting info and calling the service

public class Worker
{
    public const string Path = @"path to 200k files";
    private readonly FileProcessingService _processingService;

    public Worker()
    {
        _processingService = new FileProcessingService();
    }

    private string CalculateHash(string file)
    {
        using (var md5Instance = MD5.Create())
        {
            using (var stream = File.OpenRead(file))
            {
                var hashResult = md5Instance.ComputeHash(stream);
                return BitConverter.ToString(hashResult)
                    .Replace("-", "", StringComparison.OrdinalIgnoreCase)
                    .ToLowerInvariant();
            }
        }
    }

    private FileProcessingDto MapToDto(string file)
    {
        var fileInfo = new FileInfo(file);
        return new FileProcessingDto()
        {
            FullPath = file,
            Size = fileInfo.Length,
            FileNameWithoutExtension = fileInfo.Name,
            Hash = CalculateHash(file)
        };
    }

    public Task DoWork()
    {
        var files = Directory.GetFiles(Path)
            .Select(p => MapToDto(p))
            .ToList();

        return _processingService.Process(files);
    }
}

Note that here we act in a naive fashion and extract all files via Directory.GetFiles(Path) in one take.

However, once you run this code via

await new Worker().DoWork()

you'll notice that results are far from satisfying and the application is consuming memory extensively.

Directory.EnumerateFiles to the rescue

The thing with Directory.EnumerateFiles is that it returns IEnumerable<string> thus allowing us to fetch collection items one by one. This in turn prevents us from excessive use of memory while loading huge amounts of data at once.

Still, as you may have noticed FileProcessingService.Process has delay coded in it (sort of I/O operation we emulate with simple delay). In a real-world scenario, this might be a call to an external HTTP-endpoint or work with the storage. This brings us to the conclusion that calling FileProcessingService.Process 200 000 times might be inefficient. That's why we're going to load reasonable batches of data into memory at once.

The reworked code looks as follows

public class WorkerImproved
{
    //omitted for brevity

    public async Task DoWork()
    {
        const int batchSize = 10000;
        var files = Directory.EnumerateFiles(Path);
        var count = 0;
        var filesToProcess = new List<FileProcessingDto>(batchSize);

        foreach (var file in files)
        {
            count++;
            filesToProcess.Add(MapToDto(file));
            if (count == batchSize)
            {
                await _processingService.Process(filesToProcess);
                count = 0;
                filesToProcess.Clear();
            }

        }
        if (filesToProcess.Any())
        {
            await _processingService.Process(filesToProcess);
        }
    }
}

Here we enumerate collection with foreach and once we reach the size of the batch we process it and flush the collection. The only interesting moment here is to call service one last time after we exit the loop in order to flush remaining items.

Evaluation

Results produced by Benchmark.NET are pretty convincing

Few words on batch processing

In this article we took a glance at the common pattern in software engineering. Batches of reasonable amount help us to beat both I/O penalty of working in an item-by-item fashion and excessive memory consumption of loading all items in memory at once.

As a rule, you should strive for using batch APIs when doing I/O operations for multiple items. And once the number of items becomes high you should think about splitting these items into batches.

Few words on return types

Quite often when dealing with codebases I see code similar to the following

public IEnumerable<int> Numbers => new List<int> { 1, 2, 3 };

I would argue that this code violates Postel's principle and the thing that follows from it is that as a consumer of a property I have can't figure out whether I can enumerate items one by one or if they are just loaded at once in memory.

This is a reason I suggest being more specific about return type i.e.

public IList<int> Numbers => new List<int> { 1, 2, 3 };

Conclusion

Batching is a nice technique that allows you to handle big amounts of data gracefully. Directory.EnumerateFiles is the API that allows you to organize batch processing for the directory with a large number of files.

DEV Community