Tai Kedzierski

Posted on Jul 17, 2020

A slightly less naive binary diff

#binary #diff #linux #python

This is a small, naive project I started because I remember one day wondering, "how would I visually diff binary files for small subtle changes?" I cannot remember what it was I was trying to inspect at the time, but I know it's not been the only time I've wanted to do this. So here we are.

(Note that this project was mainly for fun and exercising some python programming)

When asking the Internets about diffing binaries, there are some lackluster answers on StackOverflow that suggest to "just use hexdump and then diff." (like this)

The problem is that if you use hexdump, you can be guaranteed that after the first change, unless the byte count remained the same between the files, the entire remainder of the stream (swathes of null-bytes notwithstanding) will be different from its counterpart for each file, owing to the offset codes and the displacement of the bytes.

For example, two files in which there is a list of animals, where only one line has changed, with diff run on their hexdump outputs, shows:



< 00000000  41 20 66 65 72 72 65 74  0a 41 20 64 6f 67 0a 41  |A ferret.A dog.A|
< 00000010  20 67 6f 61 74 0a 41 20  73 77 61 6e 0a 41 20 63  | goat.A swan.A c|
< 00000020  61 74 0a                                          |at.|
< 00000023
---
> 00000000  41 20 66 65 72 72 65 74  0a 41 20 64 6f 67 0a 47  |A ferret.A dog.G|
> 00000010  6f 61 74 73 0a 41 20 73  77 61 6e 0a 41 20 63 61  |oats.A swan.A ca|
> 00000020  74 0a                                             |t.|
> 00000022

Because of the displacement of the bytes, diff decides to tell me that the files are fundamentally different at every point.

The tool I knocked together (in one afternoon, that's to say how naive it still is) gets around that by artificially re-introducing the concept of "chunks".

Where diff is most useful is in showing which parts of source code have changed, and then recognising the rest as "unchanged", but to do so it has to work on its smallest data sequence - the line. There are no "lines" in binary files, so instead we need to pre-process the binary data and split it into chunks. In this way, even if the byte count at the point of change is different, diff will only show which chunk was affected. Once the chunk ends, a new chunk begins in each file, and since these are now identical, diff doesn't continue marking the two as different.

I used Python to implement a custom file reader (that could certainly be optimised for disk access efficiency, I know), the script uses a list of specific bytes it expects to split chunks on (currently, CR, LF and NULL). When it encounters a grouped sequence of such bytes (for example, the fields of NULL that occur in compiled programs), it keeps them all together as a single "terminator." This keeps the chunking sane.

In case it wasn't quite clear:

Example

Lets say we have these two hello-world programs:



#include <stdio.h>
int main() {
  printf("Hello world");
}



#include <stdio.h>
int main() {
  printf("Adieu monde cruel!");
}

And we compile them each to hello.exe and adieu.exe.

The naive approach would be to do this:



diff <(hd hello.exe) <(hd adieu.exe) -u --color

After the first change in bytes, a vast swathe of the stream is highlighted as changed, even though only a few bytes changed (only interrupted due to a vast field of null bytes)

Conversely, you can run



bdiff.sh hello.exe adieu.exe -u --color

As well as no longer needing to use the command/file substitution, we also see in the output that only the relevant chunks have changed. Whereas in the naive version we are none the wiser as to what byte sequences differed (because "everything" differed), we can see the more discrete and precise locations where the data differs.

I doubt this tool will ever be of in fact production usefulness, but who knows. Other people were wondering too, so there must be some use-case.

Top comments (6)

Thomas Kluyver • May 28 '21

I just had a use for this - I wanted to see exactly how a gzipped file differed from plain zlib compression. Gzip adds some data to the start and the end of the compressed stream, so hexdumping and then diffing would just show the file completely changed, as you saw.

binary-diff worked nicely, giving me a readable diff of about 30 lines total in about 10 seconds (on files about 2 MB each). :-)

Does diff operate in a streaming fashion, or does it need to load the two inputs fully into memory? If the files have to be fully loaded anyway, I wonder if it could be faster by chunking both in Python and using difflib.SequenceMatcher to compare the lists of chunks, rather than formatting & streaming all the data from both files. That would mean giving up the nice output options of diff, though.

Tai Kedzierski • May 31 '21

Well I'll be. It was useful after all 😅

I just ran diff -a $ISO_1 $ISO_2 on two ISO files, and note that diff indeed pre-processes entire files before producing output (waited ~13s to get output).

Using difflib might gain advantages in speed, but as you note, we like the nice options given by native diff 😁

One option could actually be to set a byte cap on bexplode.py so that it stops reading after it has processed N bytes worth of data in each file - at least, that would mean that file header inspections would not be so tedious.

Thomas Kluyver • May 31 '21

I wonder if the Python program could use difflib to identify long similar sequences and abbreviate them with ellipsis in the middle, make two temp files with the differing chunks plus the abbreviated matches for context, and feed them into diff. It would mess up the 'line numbers' by condensing a bunch of matching lines into one, but I guess line numbers aren't really something people would look at often in this context.

Tai Kedzierski • May 31 '21 • Edited

I think I see where this is going - you want the python program to eliminate the duplicates before getting diff to do the display.

Alas, that's not what the tool does - chunk_hexdump.py is a preprocessor for a single data file, turning it into a hex string output. At the python level, there is no awareness of a second file whose chunks can be compared to.

Looking at the code I could actually change it a little, such that instead of loading the entire file to memory (the chunker does this currently) before printing the chunk, it prints them as they are read and processed, without holding it all in RAM.

~~This would not be a necessary speed gain, but with big files, it would avoid causing paging happening on the two python processes...~~

Correction. The chunker already just prints chunks as they are found. The bottleneck is likely indeed diff...

Thomas Kluyver • May 31 '21

Yup, I was wondering about rearranging it so that it's driven by a single Python process which calls diff in a subprocess, rather than a bash script which calls Python twice.

In my tests, calling bexplode.py on either of the individual files takes about 5 seconds, so I think the 10 seconds to get the diff is dominated by doing this twice. (I'm piping the output to /dev/null for timing, so it's not limited by the terminal)

But maybe it's possible to just make bexplode faster. E.g. I spotted that from Python 3.8, bytes.hex() takes a separator parameter, so sep=' ' can replace a re.sub() call. That's about a 30% speedup in my tests.

Tai Kedzierski • May 31 '21 • Edited

So the current bash shell actually causes two calls to the python script indeed ; each one's output actually writes to a named pipe (typically held in RAM) rather than to disk.

I suspect the overhead of calling the two python processes from a bash script would be significantly outweighed by writing to two on-disk temp files, which diff then needs to read... I was partway through trying to implement the suggestion when I realised this... 🤦‍♂️

I did have a look at difflib to see if there's a way of passing the chunks to it gradually, but no, it expects a pre-loaded entirety of data in the form of a list of strings. On reflection, I do believe it is part of the diffing algorithm to look at "future" data further down so as to compute the context. Worst-case scenario, the context will be searched for on the whole file...

Generally is a sensible option, because diff is normally used with source files in the small kilobytes range....

Try running

# Print the start date
date

# Run a no-op that takes the file-descriptors for FILE1 and FILE2 simultaneously
# The completion times for each will write to stderr only after completion
: <(bexplode.py FILE1 >/dev/null; echo "File 1: $(date)" >&2) <(bexplode.py FILE2 >/dev/null; echo "File 2: $(date)" >&2)

echo "Please wait - processing in background ..."

It should show that the total time will only be the time of the largest file.

The main culprit is likely diff, loading both fully, and then running the diffing algorithm...

DEV Community

A slightly less naive binary diff

Example

Top comments (6)

Read next

Setting Up a Comprehensive Python Build Validation Pipeline in Azure DevOps

GIL "removal" for Python true multi threading

What to Expect When You Start Learning a Programming Language

Understanding Python Decorators: A Deep Dive