loading...
Cover image for Everything is a File

Everything is a File

awwsmm profile image Andrew (he/him) ・4 min read

One of the defining features of UNIX and UNIX-like operating systems is that "everything is a file" [1].* Documents, directories, links, input and output devices and so on, are all just sinks for or sources of streams of bytes from the point of view of the OS [2]. You can verify that a directory is just a special kind of file with the command:

$ view /etc

...which will show the contents of the /etc file (directory). Note that view is shorthand for running vi in read-only mode, vi -R [7], so you can exit it by typing :q!. Be extremely careful when opening directories as files -- you can cause serious damage to your system if you accidentally edit something you shouldn't have. On a Linux file system, the /dev directory contains files that represent devices, /proc contains special files that represent process and system information, and so on:

$ ls -l /dev/std*
lrwxrwxrwx 1 root 15 Nov  3  2017 /dev/stderr -> /proc/self/fd/2
lrwxrwxrwx 1 root 15 Nov  3  2017 /dev/stdin -> /proc/self/fd/0
lrwxrwxrwx 1 root 15 Nov  3  2017 /dev/stdout -> /proc/self/fd/1

$ cat /proc/uptime  # shows current system uptime
36710671.21 1127406622.14

$ cat /dev/random  # returns randomly-generated bytes
{IGnh▒I侨▒▒Ұ>j▒L▒▒%▒=▒▒U@▒Q▒2▒;▒l▒q▒$▒r▒1▒U񝾎...

Other common /dev files include /dev/zero which produces a constant stream of zeros, and /dev/null, which accepts all input and does nothing with it (think of it like a rubbish bin) [8]. Note that "files" can show the same information each time you open them, or they may be constantly changing -- reflecting the current state of the system, or displaying the results of some constantly-running process.

But why is everything a file in UNIX?

As Dennis Ritchie and Ken Thompson outlined in their 1974 Communications of the ACM paper, "The UNIX Time-Sharing System", there are three main advantages to the "everything is a file" approach:

  1. file and device I/O are as similar as possible
  2. file and device names have the same syntax and meaning, so that a program expecting a file name as a parameter can be passed a device name
  3. special files [/dev/random, etc.] are subject to the same protection mechanism as regular files

The advantages of this approach are illustrated in this 1982 Bell Labs video, in which Brian Kernighan, creator of the AWK programming language and co-creator of UNIX, describes how programs can be pipelined to reduce unnecessary code repetition (through modularisation) and increase flexibility (jump to 05:30):


$ makewords sentence | lowercase | sort | unique | mismatch

The code Kernighan types is reproduced above. In it, makewords splits the text file sentence into words (separated by whitespace characters) and returns one word per line. makewords was passed a file as input and its output would normally be sent to the terminal, but we've piped it (using |) as input to the next method, lowercase. lowercase treats each line as a separate argument, converting all uppercase characters to lowercase, and then pipes its output to sort, which sorts the list of words alphabetically. sort pipes its output to unique, which removes duplicate words from the list, and sends its output to mismatch, which checks the list of unique, all-lowercase words against a dictionary file. Any misspelled words (words not appearing in the dictionary) are then printed to the terminal by default. By connecting these five separate functions, we've easily created a brand new function which spell-checks a file.

Note that input can come from regular files, but also special files like disks and other input devices like the terminal itself. Output can be sent to the terminal, or to other programs, or written to the disk as files. This ability to pipeline together functions, treating files and disks and special I/O devices identically, greatly increases the power and flexibility of UNIX relative to other systems which treat these things differently from one another.


* This is more correctly written as "everything is a file descriptor or a process" [3]. (File descriptors are also sometimes called "handles".) File descriptors are simply non-negative indices which refer to files, directories, input and output devices, etc. [5] The file descriptors of stdin, stdout, and stderr are 0, 1, and 2, respectively. This is why, when we want to suppress error output, we redirect it (with the > command [4]) to /dev/null with

$ command 2>/dev/null

We can send all output (stdout and stderr) of a particular command as input to another command with 2>&1 |, or, more simply, '|&' [6]:

$ command1 2>&1 | command2
$ command1 |& command2

The OS maps file descriptors to files (the actual locations of the bytes on disk) using the inode (index node) [10].

Processes [9] are separate from file descriptors. A process is an instance of a program which is currently being executed. A process contains an image (read-only copy) of the code to be executed, some memory and heap space to use during execution, relevant file descriptors, and so on. Processes also have their own separate indexing system (process ids, or pids).


Related:

Introduction to Linux (2008), Machtelt Garrels

An Introduction to UNIX Processes (2014), Brian Storti

Ghosts of UNIX Past: A Historical Search for Design Patterns (2010), Neil Brown

unix-history-repo (continuous UNIX commit history from 1970 until today)

Posted on by:

awwsmm profile

Andrew (he/him)

@awwsmm

Got a Ph.D. looking for dark matter, but not finding any. Now I code full-time. Je parle un peu français. dogs > cats

Discussion

markdown guide
 

Great article. Kernighan's example reminds me of Doug McIlroy's code review of Donald Knuth where he replaces a large Pascal program with a shell one-liner.

The related section is very interesting.

 

Within the unix-history-repo, there's a link to this visualisation, which I think is pretty cool.

 

For anyone interested, the visualization is created using Gource.