Tomasz Wegrzanowski

Posted on Sep 8, 2021

Electron Adventures: Episode 43: File Types

#ruby

We spent a lot of time experimenting with frameworks. Let's pause that for a moment and instead add some functionality to our file manager.

File types

A lot of functionality like viewing or editing or executing files, or even displaying them in a nice way with proper icons, will depend on us knowing "type" of the file.

On the web we don't need to worry about that, as HTTP responses come with Content-Type header, which is hopefully correct. And we rarely do anything dynamic based on response type, just handing all responsibility over to the browser.

For local files, we need to do our work.

File extensions

Traditional way of determining file type is to use file extension. For example, .txt is a text file, .jpg is a picture, .mp3 is a music file, etc.

The best thing about it is that you don't need to inspect the file contents to determine the type. If it's called kitten17.jpg, it's a picture.

This is a good start. Unfortunately that's not a complete solution, and we need to deal with issues like:

in addition to name, we need to check other file properties, like is it directory, is it symlink etc.
a lot of files have no extension, like executables on Unix systems
because there are only so many possible file extensions, it's common for multiple unrelated formats to share them
some files have double extensions, like .tar.gz
some files like XML can be crazy many different things - fortunately people mostly use more specific extensions, like SVG XML files will generally be .svg not .xml
in certain sense directories can also have "types", like .git, node_modules, *.app, and might require special handling

Looking inside files

An alternative to file extension is to look inside the file.

Most binary file formats have specific bytes at some specific position - usually at start of file, occasionally at the end. Wikipedia has some examples, but there's a lot more.

The main downside of doing this is that we need to read at least some contents of the file. For most formats reading the first block ("block" being 512 or 1024 or such bytes - files are stored this way on disk so we don't need to worry exactly how many bytes to read - it will generally get a miminum of one block) will do.

It's no big deal if user asked to view one specific file, but if we want to know types of 1000 files in a directory just to put proper icons next to them, reading the first block of each file is going to have significant performance cost.

`file` command

Unix systems come with file program that does it for us, either in human readable form, or as mime type. Unfortunately it's not very good:

```$ shell
$ file *
index.js: ASCII text
node_modules: directory
package-lock.json: JSON data
package.json: JSON data
preload.js: ASCII text
public: directory
rollup.config.js: Java source, ASCII text
src: directory
$ file --mime-type *
index.js: text/plain
node_modules: inode/directory
package-lock.json: application/json
package.json: application/json
preload.js: text/plain
public: inode/directory
rollup.config.js: text/x-java
src: inode/directory




`file` only looks at file contents not name, and why it got an idea that `rollup.config.js` is Java not JavaScript, while `index.js` is plaintext, is anyone's guess.

And obviously it's not available on Windows.

### `#!` shebang

For text-based executables, there's a special `#!` shebang line at the start of the file. It tells the operating system which language to use to run the file.

It requires some mapping, for example these are just some of the real `#!` lines from my computer, all containing various Python programs:

* `#!/usr/bin/env python`
* `#!/usr/bin/python2`
* `#!/usr/local/Cellar/awscli/2.2.14/libexec/bin/python3.9`
* `#!/usr/local/Cellar/python@3.9/3.9.6/Frameworks/Python.framework/Versions/3.9/bin/python3.9`
* `#!/usr/local/opt/python/bin/python3.7`
* `#!/usr/local/opt/python@3.9/bin/python3`
* `#!/usr/local/opt/python@3.9/bin/python3.9`

`file` does an OK job with that.

### Algorithhm for detecting file type

As none of these methods are sufficient on their own, we need to combine them:

* first we check file stats - is it a directory, is it a symlink, is it a regular file etc.
* then we check file extension
* only then if extension is ambiguous or missing, we check file contents, either by reading the first block of the file, or running `file` command

And then it might be a good idea to cache that information, as it's quite expensive to calculate, and file types don't just change all the time.

However even then, we're not done yet. Once we detect file type, we need some list of what to do with each types, so how to show them, view them, or edit them, or execute them, and so on.

### Code for today

There's plenty of libraries for detecting file type, but let's write one ourselves.

I'll do it in Ruby, as command line Javascript is somewhat tedious.

And I'm not even attempt to be comprehensive here, just do something that establishes pattern we could use to write a proper file type library.

Any real file detector would have a list of at least a few hundred file types.

### Iterate for every argument

We'll get list of paths to check from command line. Let's instantiate our `FileType` class for each of them, then `.call` it.



```ruby
require "pathname"

class FileType
  attr_reader :path

  def initialize(path)
    @path = Pathname(path)
  end

  ...

  def call
    puts "#{path}: #{type}"
  end
end

ARGV.each do |path|
  FileType.new(path).call
end

Deal with symlinks

The first thing we need to do is decide what to do with symlinks. I'll just add extra "(symlink)" to end of the type.

  def type
    if @path.symlink?
      "#{base_type} (symlink)"
    else
      base_type
    end
  end

Type from file stat

The next step is dealing with directories. We should also support nonexistent files, and various operating system specific special files (like block devices, character devices, sockets, and pipes).

  def base_type
    return "inode/directory" if @path.directory?
    type_from_extension or type_from_contents
  end

Type from extension

First, if we can use file extension to figure out the type, without reading what's inside, we should definitely do that.

  def type_from_extension
    case @path.extname.downcase
    when ".png"
      "image/png"
    when ".jpg", ".jpeg"
      "image/jpg"
    when ".js", ".jsx"
      "text/javascript"
    when ".json", ".jsx"
      "application/json"
    when ".md"
      "text/markdown"
    when ".txt"
      "text/plain"
    end
  end

Of course this part would need to be a lot longer to come even close to covering enough file types.

Type from interpretter

Extension is not of much help? Time to check the contents.

This step is sort of optional, but file is pretty bad at #! files, and I'm not even sure why, as it's quite easy.

The idea is that these paths are typically #!/some/path/<interpretter><version> <arguments>, and we just want to get the intepretter.

Alternatively they're #!/usr/bin/env <interpretter><version> <arguments>, so if interpretter is env, we just keep going.

  def first_line
    @first_line ||= @path.open(&:readline)
  end

  def interpretter
    @interpretter ||= begin
      parts = first_line[2..-1].split
      command = parts[0].split("/").last
      command = parts[1].split("/").last if command == "env"
      command
    end
  end

  def type_from_shebang
    case interpretter
    when /\Aruby/
      "text/ruby"
    when /\Apython/
      "text/python"
    when /\Anode/
      "text/javascript"
    when /\A(sh|bash)/
      "text/bash"
    when /\Aperl/
      "text/perl"
    else
      "text/x-#{interpretter}"
    end
  end

  def type_from_contents
    if first_line[0, 2] == "#!"
      type_from_shebang
    else
      `file -b --mime-type #{@path.to_s.shellescape}`.chomp.split.first
    end
  end

And finally if we can't figure it ourselves we call file command. Unfortunately OSX file often adds some extra crap at the end even if we ask for just mime type (-b --mime-type), so just keep the first word of its output.

Result

Here's the results:

For something so simple, it's working pretty decently. In the next episode, we'll add file type icons to our file manager.

As usual, all the code for the episode is here.

DEV Community

Electron Adventures: Episode 43: File Types

File types

File extensions

Looking inside files

`file` command

Deal with symlinks

Type from file stat

Type from extension

Type from interpretter

Result

Top comments (0)

Read next

Did you know there’s a state of the web report?

Amazon Q Developer Tips: No.21 Amazon Q Developer Agents - /test

Unleash the Power of Your Links (UrlHub)

Vercel Next.js Integration

File types

File extensions

Looking inside files

file command

Deal with symlinks

Type from file stat

Type from extension

Type from interpretter

Result

Read next

Did you know there’s a state of the web report?

Amazon Q Developer Tips: No.21 Amazon Q Developer Agents - /test

Unleash the Power of Your Links (UrlHub)

Vercel Next.js Integration

`file` command