Sungu, Hasan Emrah

Posted on Dec 21, 2021

How to compare two files using .NET, really really fast!

#csharp #dotnet #algorithms

Optimization Problem

Last week, I finished my tasks at work early so I had some time at hand, which I used to optimize some tools at work!

Without any delay let us move onto what I did. SO, there is this in-company tool which is used to serialize files from "SOURCE" folder, to "TARGET" folder. If source file and target file are different, it also copies the source file to target directory. So far so good! However, when I looked at the source code, I noticed that the comparison is done by checking MD5 hashes of both files.

This is an optimization chance for the following reasons:
a) In order to calculate MD5hash, all the bytes of the file has to be read and then put through message-digest algorithm.
b) A matching MD5 hash does not 100% mean, files are the same

*Since MD5 is a 128bit hashing algorithm, Birthday problem predicts on average hashing 2⁶⁴ files will end up with a collision ← well, not much of a possibility anyways

Benchmark

※I used BenchmarkDotNet which is also supported by the .NET Foundation, to benchmark the code I wrote. I am comparing 128mB of same files!

Since I want to have a standardized interface, I started out by designing my FileComparer class. I think I commented good enough so it does not need any further explaining :)

MD5 Comparison

Now to implement our baseline, which is MD5 comparison.

After running BenchmarkDotNet, here we can see the result.

Method	Mean	Error	StdDev	Ratio
Md5Comparer	1.302 s	0.0090 s	0.0080 s	1.00

Compare Individual Bytes

Can we do better? Of course! We do not need to calculate the MD5 hash at all. We can do a byte-to-byte check and do an early return as soon as we find a not matching byte!

Wow! ReadWholeFileAtOnce is almost 7 times faster, but it is not memory efficient. It reads whole file into memory even before doing any comparison! But Can we do better? Of course!

Method	Mean	Error	StdDev	Ratio
Md5Comparer	1,296.4 ms	11.02 ms	10.31 ms	1.00
ReadWholeFileAtOnce	190.2 ms	2.66 ms	2.49 ms	0.15

LINQ is not that fast

But before doing better I would like to show "efficiency" of LINQ queries.

Method	Mean	Error	StdDev	Ratio
Md5Comparer	1,299.5 ms	6.09 ms	5.40 ms	1.00
CompareUsingLinq	830.0 ms	1.86 ms	1.45 ms	0.64
UseSequenceEquals	805.5 ms	7.62 ms	7.13 ms	0.62
ReadWholeFileAtOnce	190.4 ms	2.90 ms	2.71 ms	0.15

As can be seen from the results, simple for loop is 4 times faster than a LINQ query. Of course, it does not mean that this is always the case but for high performance code, it is wise to stay away from LINQ and stick to simple choices.

Read File in Chunks

Continuing on! ...But Can we do better? Of course!
Now is the time to up the optimization game! We will read the files in chunks of certain size and then compare the bytes. Here is the new base class which inherits from FileComparer.cs.

Next thing is to choose a suitable ChunkSize, which should yield a faster speed than reading the whole file at once and also must be better with memory efficiency as well. Here is a benchmark I run on the suitable BufferSize. I chose my candidates as
4096 * 1, 4096 * 2, 4096 * 4, 4096 * 8, 4096 * 16, 4096 * 32, 4096 * 64, 4096 * 128 bytes and turns out 4096 * 32 is a good choice.

So currently, reading files in chunks of 4096 * 32 and comparing them one byte at a time seems to be the fastest method, almost 8 times faster than MD5 comparison

Method	BufferSize	Mean	Ratio
Md5Comparer		1,298.1 ms	1.00
AllFile		190.6 ms	0.15
Chunks_One	131072	168.7 ms	0.13

Reading in Chunks comparing 8 bytes at a time

But Can we do better? Of course! There is a pretty neat trick which you can find in some implementations in glibc, about strlen or memcmp as well. It makes use of comparing word length of bytes at a time! Here is the source code!

So, it looks like comparing 8 bytes at a time is a reasonable trick and we end up with 10 times the speed of MD5 comparison!

Method	BufferSize	Mean	Ratio
Md5Comparer		1,298.1 ms	1.00
AllFile		190.6 ms	0.15
Chunks_One	131072	168.7 ms	0.13
Chunks_Eight	131072	126.9 ms	0.10

Using SIMD

But Can we do better? Of course! By vectorizing our problem at hand, we can use SIMD instructions as long as our hardware supports it. Here is a class using general purpose Vecotr class.

After Vectorizing, we are more than 15 times faster the MD5 comparison speed!

Method	BufferSize	Mean	Error	StdDev	Ratio
Md5	131072	1,293.05 ms	7.763 ms	6.882 ms	1.00
Vector	131072	81.92 ms	0.240 ms	0.188 ms	0.06

But can we do better? Of course!
I am currently using a CPU which supports AVX2 SIMD instructions. Instead general purpose Vector class, let us use hardware specific instrinsics.

※I guess almost any CPU released after the year 2013 should support it, but I am not sure.

After AVX2, we are more than 16 times faster the MD5 comparison speed!

Method	BufferSize	Mean	Error	StdDev	Ratio
Md5	131072	1,293.05 ms	7.763 ms	6.882 ms	1.00
AVX2	131072	81.92 ms	0.240 ms	0.188 ms	0.06

Top comments (5)

Emmanuel Counasse • Aug 7 '22

Be carefull the algorithm ReadFileInChunksAndCompareAvx2 is not correct you must move into oh1 and oh2 otherwise you are always comparing the same segment.

I would fix the last loop with :

var totalProcessed = 0;
while (totalProcessed < count1) {
    var result = Avx2.CompareEqual(Avx.LoadVector256(oh1+ totalProcessed), Avx.LoadVector256(oh2+totalProcessed));
    if (Avx2.MoveMask(result) != -1) {
        return false;
    }
    totalProcessed += Vector256<byte>.Count;
}

Markus Koch • Dec 21 '21

Thanks for the helpful post. However, there is one small error. The table for AVX2 contains the same measured values as the Vector table.

Sungu, Hasan Emrah • Dec 21 '21

Thanks for the heads up! I also noticed a couple of more copy-paste mistakes. I will fix as soon as possible

Andreas Jakof • Dec 23 '21

Nice in depth write up! Thank you.
The situation is rare und my case, but I‘ll take away, that Vector comparison is really fast and reading chunks of 32*4096 bytes further optimizes this.

Andreas • Mar 27 '25

Great post. Thank you very much!
But why are you using a loop in ReadIntoBuffer?
Shouldn't it be enough to just read once? Isn't this, what "Stream.Read" is designed to do?