This article explores the topic of "binary" and "text" files. What is the difference between the two (if any)? Is there a clear definition for what constitutes a "binary" or a "text" file?
We start our journey with two candidate files whose content we would intuitively categorize as "text" and "binary" data, respectively:
bash
echo "hello 🌍" > message
convert -size 1x1 xc:white png:white
We have created two files: A file named message
with the textual content "hello 🌍" (including the Unicode symbol "Earth Globe Europe-Africa") and a PNG image with a single white pixel called white
. File extensions are deliberately left out.
To demonstrate that some programs distinguish between "text" and "binary" files, check out how grep
changes its behavior:
▶ grep -R hello
message:hello 🌍
▶ grep -R PNG
Binary file white matches
diff
does something similar:
▶ echo "hello world" > other-message
▶ diff other-message message
1c1
< hello world
---
> hello 🌍
▶ convert -size 1x1 xc:black png:black
▶ diff black white
Binary files black and white differ
How do these programs distinguish between "text" and "binary" files?
Before we answer this question, let us first try to come up with a definition. Clearly, on a fundamental file-system level, every file is just a collection of bytes and could therefore be viewed as binary data. On the other hand, a distinction between "text" and "non-text" (hereafter: "binary") data seems helpful for programs like grep
or diff
, if only not to mess up the output of your terminal emulator.
So maybe we can start by defining "text" data. It seems reasonable to begin with an abstract notion of text as being a sequence of Unicode code points. Examples of code points are characters like k
, ä
or א
, as well as special symbols like %
, ☢
or 🙈
. To store a given text as a sequence of bytes, we need to choose an encoding. If we want to be able to represent the whole Unicode range, we typically choose UTF-8, sometimes UTF-16 or UTF-32. Historically, encodings which support just a part of todays Unicode are also important. The most prominent ones are US-ASCII and Latin1 (ISO 8859-1), but there are many more. And all of these look different on a byte level.
Given just the contents of a file (not the history on how it was created), we can therefore try the following definition:
A file is called "text file" if its content consists of an encoded sequence of Unicode code points.
There are two practical problems with this definition. First, we would need a list of all possible encodings. Second, in order to test if the contents of a file is encoded in a given encoding, we would have to decode the whole contents of the file and see if it succeeds¹. The whole process would be really slow.
It turns out that there is a much faster way to distinguish between text and binary files, but it comes at the cost of precision.
To see how this works, let's go back to our two candidate files and explore their byte level content. I am using hexyl
as a hex viewer, but you can also use hexdump -C
:
Note that both files contain bytes within and outside of the ASCII range (00
…7f
). The four bytes f0 9f 8c 8d
in the message
file, for example, are the UTF-8 encoded version of the Unicode code point U+1F30D
(🌍). On the other hand, the bytes 50 4e 47
at the beginning of the white
image are a simple ASCII-encoded version of the characters PNG
².
So clearly, looking at bytes outside the ASCII range can not be used as a method to detect "binary" files. However, there is a difference between the two files. The image file contains a lot of NULL bytes (00
) while the short text message does not. It turns out that this can be turned into a simple heuristic method to detect binary files, since a lot of encoded text data does not contain any NULL bytes (even though it might be legal).
In fact, this is exactly what diff
and grep
use to detect "binary" files. The following macro is included in diff
s source code (src/io.c
):
#define binary_file_p(buf, size) (memchr (buf, 0, size) != 0)
Here, the memchr(const void *s, int c, size_t n)
function is used to search the initial size
bytes of the memory region starting at buf
for the character 0
. To speed this process up even more, typically only the first few bytes of the file are read into the buffer buf
(e.g. 1024 bytes). To summarize, grep
and diff
use the following heuristic approach:
A file is very likely to be a "text file" if the first 1024 bytes of its content do not contain any NULL bytes.
Note that there are counterexamples where this fails. For example, even if unlikely, UTF-8-encoded text can legally contain NULL bytes. Conversely, some particular binary formats (like binary PGM) do not contain NULL bytes. This method will also typically classify UTF-16 and UTF-32 encoded text as "binary", as they encode common Latin-1 code points with NULL bytes:
▶ iconv -f UTF-8 -t UTF-16 message > message-utf16
▶ hexdump -C message-utf16
00000000 ff fe 68 00 65 00 6c 00 6c 00 6f 00 20 00 3c d8 |..h.e.l.l.o. .<.|
00000010 0d df 0a 00 |....|
00000014
▶ grep . message-utf16
Binary file message-utf16 matches
Nevertheless, this heuristic approach is very useful. I have written a small library in Rust which uses a slightly refined version of this method to quickly determine whether a given file contains "binary" or "text" data. It is used in my program bat
to prevent "binary" files from being dumped to the terminal:
Footnotes
¹ Note that there are some encodings that write so-called byte order marks (BOM) at the beginning of a file to indicate the type of encoding. For example, the little-endian variant of UTF-32 uses ff fe 00 00
. These BOMs would help with the second point because we would not need to decode the whole content of the file. Unfortunately, adding BOMs is optional and a lot of encodings do not specify one.
² 50 4e 47
is part of the magic number of the PNG format. Magic numbers are similar to BOMs and a lot of binary formats use magic numbers at the beginning of the file to signal their type. Using magic numbers to detect certain types of "binary" files is a method that is used by the file
tool.
Top comments (7)
It's a good, informative article but most people have no clue what a "Unicode code point" is. Instead, I'd make the following distinction:
A text file consists of plain, unformatted words, letters and punctuation intended to be readable by humans. In a text file, every 8- or 16-bit "code" corresponds with exactly one letter, number or punctuation mark.
A binary file consists of complex structured data meant primarily to be read by applications that translate those structures into something useful by humans (pictures, audio, video, richly formatted text, etc).
Still, I imagine that 99% of computer users these days (except developers) never deal with text files directly. Almost all content people care about live in binary files. The exceptions to that rule are some office documents in XML or RTF format that, while technically might be text documents, are so densely coded and packed with syntax that they might as well be considered binary.
Thank you for the feedback. See answers on Reddit: reddit.com/r/programming/comments/...
Its useful to analyse what similar libraries do to check if a file is binary
github.com/search?q=isBinary
Hi David, fantastic post and explanation!
BTW I absolutely love bat, I've aliased it to cat months ago :-D
Thank you, glad you liked it!
Good post. So basically
text file is: 00 7A CA 8S 0F DE SO
binary file is : 00 01 00 01 00 10
and if a file contains more 00 than usual, it's considered binary :P
Good point. I have renamed the article. Thank you.