Are you passionate about Elixir and Phoenix? Subscribe to the Poeticoding newsletter and join happy regular readers and receive new posts by email.
A hash function is a function that converts a variable size sequence of bytes (a string, a file content etc.) to a fixed size sequence of bytes, called digest. This means that hashing a file of any length, the hash function will always return the same unique sequence of bytes for that file. It's a sort of digital fingerprint, usually represented by an hexadecimal string of length between 32 and 128 characters.
The hash of a file is useful, for example, to understand if the content of two files is identical, or if the content was corrupted during a download.
There are different hash functions, MD5, SHA-1, SHA-2, SHA-3 etc. , many of them available in Elixir.
Update 👨💻: I had initially written the examples below using MD5 algorithm (which is the weakest in the list), just because I thought to be the fastest one. @Hauleth pointed out that SHA-1 and SHA-256 should be faster on new CPUs due to Intel SHA Extensions, so I rewrote the examples using SHA-256
Hashing a string
Let's start hashing a string using the SHA-256 algorithm.
iex> :crypto.hash(:sha256,"I love Elixir")
<<164, 35, 167, 235, 69, 224, 253, 77, 180, 92, 77, 172, 37,...>>
iex> :crypto.hash(:sha256,"I love Elixir!")
<<209, 119, 188, 230, 168, 124, 98, 212, 119, ...>>
We use the hash/2
function in the :crypto
Erlang module.
The first argument is the name of the hash algorithm we want to use, in this case :sha256
, the second argument is the sequence of bytes we want to hash, in this case a string. It returns a sequence of bytes.
We see how the output changes just by appending a "!" character.
We can use Base.encode16/1
to get the hexadecimal string representation
iex> :crypto.hash(:sha256,"I love Elixir!") \
...> |> Base.encode16() \
...> |> String.downcase()
"d177bce6a87c62d4772f404fcad2f8c2d9606c04f99942b71d7c521eb79c4c3b"
If you are on a Linux or Mac machine, you can use a command line tool like sha256sum
to see that the digest corresponds
$ echo -n 'I love Elixir!' | sha256sum
d177bce6a87c62d4772f404fcad2f8c2d9606c04f99942b71d7c521eb79c4c3b -
Hashing a file
Calculating the hash of a file is conceptually the same as calculating the hash of a string. A file is a sequence of bytes and we could use the same :crypto.hash(:sha256, file_content_binary)
function. But we saw that most of the time is not a good idea to load the whole file into memory!
We can use File.stream!
and a different set of functions available in :crypto
to read and process a file in chunks.
Let's see first a simple example using the same string we've used before, divided into chunks
iex> [chunk_1, chunk_2] = ["I love ", "Elixir!"]
iex> hash_ref = :crypto.hash_init(:sha256)
#Reference<...36636>
iex> hash_ref = :crypto.hash_update(hash_ref, chunk_1)
#Reference<...36647>
iex> hash_ref = :crypto.hash_update(hash_ref, chunk_2)
#Reference<...36655>
iex> digest = :crypto.hash_final(hash_ref)
<<209, 119, 188, 230, 168, 124, 98, 212, 119, ...>>
iex> digest |> Base.encode16() |> String.downcase()
"d177bce6a87c62d4772f404fcad2f8c2d9606c04f99942b71d7c521eb79c4c3b"
We process the sequence in chunks getting the same result we've gotten previously.
hash_ref = :crypto.hash_init(:sha256)
File.stream!(file_path)
|> Enum.reduce(hash_ref, fn chunk, prev_ref->
new_ref = :crypto.hash_update(prev_ref, chunk)
new_ref
end)
|> :crypto.hash_final()
|> Base.encode16()
|> String.downcase()
- We get a hash reference from
:crypto.hash_init(:sha256)
, which is passed toEnum.reduce
as the first accumulator. - We use
Enum.reduce
to read each chunk from the file and add it to the calculation. The:crypto.hash_update/2
returns a new reference which is then set as the new accumulator. - Once processed all the chunks the final reference is then piped into the
:crypto.hash_final/1
function which returns the SHA-256 digest of the file.
We can write the reduce function in a nicer and more compact way
File.stream!(file_path)
|> Enum.reduce(:crypto.hash_init(:sha256),&(:crypto.hash_update(&2, &1)))
|> :crypto.hash_final()
|> Base.encode16()
|> String.downcase()
File.stream!
chunks vs lines
By default File.stream!
emits lines instead of just chunks. Emitting lines is slower than emitting chunks, I think because the stream needs to look for newlines while splitting the chunk in strings.
To force the stream to emit chunks we use File.stream!/3
iex> File.stream!(file_path, [], 2_048)
%File.Stream{
line_or_bytes: 2048,
modes: [:raw, :read_ahead, :binary],
path: file_path,
raw: true
}
setting a chunk size of 2048 bytes.
I made a quick benchmark (you can find on this gist) where we see that streaming chunks is faster and also better memory wise.
Name ips average deviation median 99th %
chunks 23.29 K 42.93 μs ±63.44% 41.98 μs 83.98 μs
lines 9.21 K 108.54 μs ±42.52% 93.98 μs 275.98 μs
Comparison:
chunks 23.29 K
lines 9.21 K - 2.53x slower +65.61 μs
Memory usage statistics:
Name Memory usage
chunks 2.11 KB
lines 20.84 KB - 9.88x memory usage +18.73 KB
Wrap up
We've seen what a hash function is and how to easily calculate the hash of a file using Elixir.
In the past (unfortunately I think still in the present 😅), hash functions were used to store passwords in the database. If you need to securely handle and store passwords, please use the bcrypt_elixir library!
Are you passionate about Elixir and Phoenix? Subscribe to the Poeticoding newsletter and join happy regular readers and receive new posts by email.
Top comments (0)