DEV Community

Cover image for From 'file' to 'life' - diffing in biology
Danie Palm
Danie Palm

Posted on • Updated on

From 'file' to 'life' - diffing in biology

This is the second post in the Elixir for Life series in which I'm writing about all the components required to build an artificial life system (in Elixir and Joy). In this post we explore some algorithms used to compare, respectively, code and DNA. In a future post, I hope to arrive at a point where the distinction between code and DNA disappears. Or where file has become (artificial) life.

Code

The Unix diff command has been around since 1970. It is used to compare two versions of the same file or two related files. The output being all the lines deemed to have been inserted, deleted, or changed.

Perhaps surprizingly there is no such thing as a canonical diff output. Various diff algorithms exist. If any one of them (assuming they are all correct implementations) is used to produce a diff between two files, the resulting patch, when applied to the first file, will yield the second file. But the patches or diffs produced by different algorithms need not all be identical. Different diff algorithms optimize for different outcomes: such as, for instance, keeping insertions together, or applying different weightings to whitespace changes.

The most frequently used diff algorithms are arguably those implemented by the git diff command. The default used by git diff is the Myers algorithm, which was developed around 1986. But a lot of progress has been made since Myers, both in terms of efficiency and utility of the output. Useful and readable diff output is critical when it comes to code review processes that commonly take the form of pull requests. The most useful outputs are those that give the reviewer a clearer and more concise idea of the changes made by another developer.

Research suggests that the histogram algorithm is both faster and more useful in code review and code base evolution research than the other algorithms supported by git diff. So go ahead and set it as your global git default diff algorithm:

git config --global diff.algorithm histogram

Biology

Molecular biologists, on the other hand, are interested in diffing (or aligning) related sequences of polymers in order to identify subsequences that were inserted, deleted, match exactly, or match approximately.

There are two major types of sequences that are amenable to alignment:

  • nucleic acids, and
  • proteins

Nucleic acids, like DNA or RNA, are single or double stranded sequences of nucleotides (the the well known A, G, C, and T nitrogen bases). Whereas proteins are sequences of amino acids. Nucleic acid sequences and protein sequences are related in that the former is translated to the latter according to the genetic code.

Much like commits and merging alter a software codebase, mutations and crossing-over alter the nucleotide sequences of DNA molecules and indirectly the amino acid sequences of the proteins that are translated from DNA.

By comparing or aligning biological sequences, biologists can determine how closely two sequences or even organisms are related. It allows biologists to construct phylogenetic trees, showing how all known life forms are related all the way to the Last Universal Common Ancestor (LUCA) and even beyond to older life forms with no modern-day descendants.

The algorithms used to align biological sequences have strong ties to their computer science counterparts and appeared at around the same time. The Needleman-Wunsch algorithm is used to find an end-to-end or global alignment between two sequences, whereas the Smith-Waterman algorithm finds the two subsequences, one from each parent sequence, that provide the best local alignment. Both these algorithms involve finding insertions, deletions and matches, or close-enough matches.

Below is a possible global alignment between two DNA sequences, GCATGCU and GATTACA (taken from Wikipedia):

GCATG-CU
| ||| ||
G-ATTACA

The first two elements are exact matches, followed by a deletion in the second string, two exact matches, a good-enough match, an insertion in the second sequence, an exact match, and a final good-enough match. There are many other possible alignments for these two particular sequences. Given a predetermined set of penalties for the introduction of insertions, deletions or mismatches, the Needleman-Wunsch algorithm calculates the best possible alignment, sometimes arriving at several equally good outcomes. The choice of penalties is informed by biology, but there is no one-size-fits-all scheme.

Code

Let's explore the details of the Needleman-Wunsch algorithm and how we can implement it in Elixir.

Consider the two single-element sequences A and G. In order to align these two sequences, we start with an empty alignment and then extend it in three possible ways:

  • take one element (A) from sequence one and a gap (-) from sequence two
  • take a gap (-) from sequence one and one element (G) from sequence two
  • take one element (A) from sequence one and one (G) from sequence two

This yields the following alignments (in the same order):

A          -          A
                      |
-          G          G

Note, however, that the first two alignments are incomplete in that each only uses an element from one sequence. Both of them can undergo one more round of extension, yielding three end-to-end alignments:

A-          -A          A
                        |
-G          G-          G

We now need a scoring scheme to help us pick the optimal alignment from the above three candidates. To do so, we introduce the concepts of gap penalties and similarity scores. Similarities are typically represented as numerical values in a symmetric matrix with rows and columns for each of the possible elements of a sequence. To keep things simple we calculated the similarity with this function instead:

# This function clause will match if the two arguments are identical
defp similarity_value(x, x), do: 1

# This function clause will match for any two arguments, except for
# identical arguments which will match the previous clause
defp similarity_value(_x, _y), do: -1

The corresponding similarity matrix is one where the main diagonal elements are equal to 1 and where all other elements are equal to -1. Finally, we assume a penalty of -1 whenever a gap occurs.

Let us tabulate the scores for each of the individual extensions that was made for the three alignments in a score matrix. For the first alignment candidate, an element was taken from the first sequence, but it was matched to a gap. This earns a penalty of -1.

- G
0, ⏹️
A -1, ⬆️

We start at the top left cell, representing the empty alignment, and assign it a value of 0. Since one element (A) was taken from the first sequence, we move one row down to the row labelled A. No element was taken from the second sequence, so we remain in the same column. Finally the score of -1 is assigned, because the extension includes a gap.

At the same time, we also populate a traceback value that tracks the cell representing the alignment that it extends. A value of ⬆️ means that the extension is based on the cell above it, etc. This will come in handy when we need to determine the final optimal alignment.

Next, an element was taken from the second sequence (incrementing the column index) and a gap was taken from the first (leaving the row index the same). Once again, the extension involves a gap, earning a penalty of -1. However, the score becomes -2 because we build on the -1 from the cell to the left.

- G
0, ⏹️
A -1, ⬆️ -2, ⬅️

Repeating the same process for the second alignment candidate similarly yields:

- G
0, ⏹️ -1, ⬅️
A -2, ⬆️

Finally the third alignment involved only a single extension, which simultaneously aligned elements from both sequences. We therefore move diagonally from the top left cell and assign a value of -1, which we get as a result of invoking similarity_value("A", "G"):

- G
0, ⏹️
A -1, ↖️

We now have three candidates for the bottom right cell: (-2, ⬅️), (-2, ⬆️), and (-1, ↖️). We pick (-1, ↖️) since it has the largest value, yielding the combined table:

- G
0, ⏹️ -1, ⬅️
A -1, ⬆️ -1, ↖️

The scoring algorithm above can be generalized to sequences of arbitrary length. The score and traceback can be calculated for any cell in row i and column j with the following function. It returns a tuple contain the score and the traceback value:

  # i: the row index (corresponding to the first sequence)
  # j: the column index (corresponding to the second sequence)
  # s1_i: the element of sequence one at index i
  # s2_j: the element of sequence two at index j
  # acc: the score/traceback matrix generated so far
  defp score(i, s1_i, j, s2_j, acc)

  # Row 0 and column 0, the empty alignment.
  # We assign the score 0, and populate the traceback with `done`
  defp score(0 = i, s1_i, 0 = j, s2_j, acc) do
    {0, :done}
  end

  # Row 0, gap extension in sequence one
  # The number of gaps inserted so far equals j
  # The cell can only be reached from the `left`
  defp score(0 = i, s1_i, j, s2_j, acc) do
    {j * -1, :left}
  end

  # Column 0, gap extension in sequence two
  # The number of gaps inserted so far equals i
  # The cell can only be reached from the cell above: `up`
  defp score(i, s1_i, 0 = j, s2_j, acc) do
    {i * -1, :up}
  end

  # With the above special cases out of the way,
  # any cell can be reached by:
  #   inserting a gap in sequence one,
  #   inserting a gap in sequence two,
  #   or by taking elements from both sequences
  defp score(i, s1_i, j, s2_j, acc) do
    # We lookup the values to the top left, to the top, and to the left
    {diag, _} = acc[{i - 1, j - 1}]
    {up, _} = acc[{i - 1, j}]
    {left, _} = acc[{i, j - 1}]

    # We apply the possible extensions, calculate a new score
    # and keep only the best option
    [
      {diag + substitution_value(s1_i, s2_j), :diag},
      {up - 1, :up},
      {left - 1, :left}
    ]
    |> Enum.max_by(&elem(&1, 0))
  end

It this point a note on the underlying data structure that stores the matrix is warranted. It may be tempting to represent the matrix as a list of lists. But the matrix needs to be populated from left to right and top to bottom, meaning that we'd need to append new values to the lists. Linked lists are more appropriate when *pre*pending on extension.

list_based_matrix =
  [
    [{0, :done, {-1, :left}],
    [{-1, :up}, {-1, :diag}]
  ]

Another option is to use a tuple of tuples. But tuples have to be copied when changes are made to them. So it is best to use tuples when you know what the elements are up front.

tuple_based_matrix =
  {
    {{0, :done, {-1, :left}},
    {{-1, :up}, {-1, :diag}}
  }

So we use a map instead, which maps a tuple of coordinates to the appropriate score and traceback tuple:

map_based_matrix =
  %{
    {0, 0} => {0, :done},
    {0, 1} => {-1, :left},
    {1, 0} => {-1, :up},
    {1, 1} => {-1, :diag}
  }

Maps, like all data structures in Elixir are immutable, but similar to linked lists, maps don't have to be copied in full when a key value pair is added, but rather rely on pointers to the parent structure. Moreover, values can be lookup up in constant time.

Using the map data structure above, the score and traceback matrix can be populated with the following function:

defp score_matrix(s1, s2) do
    s1_with_index = Enum.with_index(["" | s1])
    s2_with_index = Enum.with_index(["" | s2])

    # Note the `reduce` option! It allows you to
    # use `for` like you would `Enum.reduce/3`
    for {s1_i, i} <- s1_with_index,
        {s2_j, j} <- s2_with_index,
        reduce: %{} do
      acc -> Map.put(acc, {i, j}, score(i, s1_i, j, s2_j, acc))
    end
  end

Now that the matrix is fully populated, we start at the bottom right cell (which per definition contains the score for the best alignment) and recursively follow the directives of the traceback all the way to the top left cell. And we construct the alignment in reverse in the process.

def traceback(score_matrix, s1, s2) do
  # Start at the bottom right cell
  i = Enum.count(s1)
  j = Enum.count(s2)
  {score, match} = Map.get(score_matrix, {i, j})

  {s1_aligned, pipes, s2_aligned} =
    do_traceback(score_matrix, s1, s2, i, j, match, [], [], [])

  {score, s1_aligned, pipes, s2_aligned}
end

# :done indicates that the alignment is done, we just return it
# This is the base case of the recursion.
defp do_traceback(score_matrix, s1, s2, i, j, :done, s1_aligned, pipes, s2_aligned) do
  {s1_aligned, pipes, s2_aligned}
end

# :up indicates that a gap was inserted in the second sequence
defp do_traceback(score_matrix, s1, s2, i, j, :up, s1_aligned, pipes, s2_aligned) do
  s1_aligned = [Enum.at(s1, i - 1) | s1_aligned]
  pipes = [" " | pipes]
  s2_aligned = ["-" | s2_aligned] # here we prepend a gap

  # We now move to the cell right above this one
  {_, match} = score_matrix[{i - 1, j}]
  do_traceback(score_matrix, s1, s2, i - 1, j, match, s1_aligned, pipes, s2_aligned)
end

# :left indicates that a gap was inserted in the first sequence
defp do_traceback(score_matrix, s1, s2, i, j, :left, s1_aligned, pipes, s2_aligned) do
  s1_aligned = ["-" | s1_aligned] # here we prepend a gap
  pipes = [" " | pipes]
  s2_aligned = [Enum.at(s2, j - 1) | s2_aligned]

  # We now move to the cell left of this one
  {_, match} = score_matrix[{i, j - 1}]
  do_traceback(score_matrix, s1, s2, i, j - 1, match, s1_aligned, pipes, s2_aligned)
end

# :diag indicates that the alignment should be extended with elements
# from both sequences
defp do_traceback(score_matrix, s1, s2, i, j, :diag, s1_aligned, pipes, s2_aligned) do
  s1_aligned = [Enum.at(s1, i - 1) | s1_aligned]
  pipes = ["|" | pipes]
  s2_aligned = [Enum.at(s2, j - 1) | s2_aligned]

  # We now move to the cell to the top left of this one
  {_, match} = score_matrix[{i - 1, j - 1}]
  do_traceback(score_matrix, s1, s2, i - 1, j - 1, match, s1_aligned, pipes, s2_aligned)
end

And finally, we can align two sequences with the following function:

def align(s1, s2) when is_list(s1) and is_list(s2) do
  score_matrix = score_matrix(s1, s2)
  traceback(score_matrix, s1, s2)
end

And the output can be pretty printed with:

def format({score, s1_aligned, pipes, s2_aligned}) do
  IO.puts(score)
  IO.puts(Enum.join(s1_aligned))
  IO.puts(Enum.join(pipes))
  IO.puts(Enum.join(s2_aligned))
end
iex(1)> alignment = Belign.NeedlemanWunsch.align("GATTACA", "GCATGCU")
{0, ["G", "-", "A", "T", "T", "A", "C", "A"],
 ["|", " ", "|", " ", "|", "|", "|", "|"],
 ["G", "C", "A", "-", "T", "G", "C", "U"]}
iex(2)> Belign.NeedlemanWunsch.format(alignment)
0
G-ATTACA
| | ||||
GCA-TGCU
:ok

Here is the code in full:

defmodule Belign.NeedlemanWunsch do
  def align(s1, s2) when is_binary(s1) and is_binary(s2) do
    align(String.split(s1, "", trim: true), String.split(s2, "", trim: true))
  end

  def align(s1, s2) when is_list(s1) and is_list(s2) do
    score_matrix = score_matrix(s1, s2)
    traceback(score_matrix, s1, s2)
  end

  defp score_matrix(s1, s2) do
    s1_with_index = Enum.with_index(["" | s1])
    s2_with_index = Enum.with_index(["" | s2])

    for {s1_i, i} <- s1_with_index, {s2_j, j} <- s2_with_index, reduce: %{} do
      acc -> Map.put(acc, {i, j}, score(i, s1_i, j, s2_j, acc))
    end
  end

  defp score(0 = _i, _s1_i, 0 = _j, _s2_j, _acc) do
    {0, :done}
  end

  defp score(0 = _i, _s1_i, j, _s2_j, _acc) do
    {j * -1, :left}
  end

  defp score(i, _s1_i, 0 = _j, _s2_j, _acc) do
    {i * -1, :up}
  end

  defp score(i, s1_i, j, s2_j, acc) do
    {diag, _} = acc[{i - 1, j - 1}]
    {up, _} = acc[{i - 1, j}]
    {left, _} = acc[{i, j - 1}]

    # NOTE that we only consider the first max. So in order to
    # favour match/mismatch over indels, we put it higher up in the'
    # list.
    [
      {diag + substitution_value(s1_i, s2_j), :diag},
      {up - 1, :up},
      {left - 1, :left}
    ]
    |> Enum.max_by(&elem(&1, 0))
  end

  defp substitution_value(x, x), do: 1
  defp substitution_value(_, _), do: -1

  def traceback(score_matrix, s1, s2) do
    # Start at the bottom right cell
    i = Enum.count(s1)
    j = Enum.count(s2)
    {score, match} = Map.get(score_matrix, {i, j})

    {s1_aligned, pipes, s2_aligned} =
      do_traceback(score_matrix, s1, s2, i, j, match, [], [], [])

    {score, s1_aligned, pipes, s2_aligned}
  end

  defp do_traceback(_score_matrix, _s1, _s2, _i, _j, :done, s1_aligned, pipes, s2_aligned) do
    {s1_aligned, pipes, s2_aligned}
  end

  defp do_traceback(score_matrix, s1, s2, i, j, :up, s1_aligned, pipes, s2_aligned) do
    s1_aligned = [Enum.at(s1, i - 1) | s1_aligned]
    pipes = [" " | pipes]
    s2_aligned = ["-" | s2_aligned]

    {_, match} = score_matrix[{i - 1, j}]
    do_traceback(score_matrix, s1, s2, i - 1, j, match, s1_aligned, pipes, s2_aligned)
  end

  defp do_traceback(score_matrix, s1, s2, i, j, :left, s1_aligned, pipes, s2_aligned) do
    s1_aligned = ["-" | s1_aligned]
    pipes = [" " | pipes]
    s2_aligned = [Enum.at(s2, j - 1) | s2_aligned]

    {_, match} = score_matrix[{i, j - 1}]
    do_traceback(score_matrix, s1, s2, i, j - 1, match, s1_aligned, pipes, s2_aligned)
  end

  defp do_traceback(score_matrix, s1, s2, i, j, :diag, s1_aligned, pipes, s2_aligned) do
    s1_aligned = [Enum.at(s1, i - 1) | s1_aligned]
    pipes = ["|" | pipes]
    s2_aligned = [Enum.at(s2, j - 1) | s2_aligned]

    {_, match} = score_matrix[{i - 1, j - 1}]
    do_traceback(score_matrix, s1, s2, i - 1, j - 1, match, s1_aligned, pipes, s2_aligned)
  end

  def format({score, s1_aligned, pipes, s2_aligned}) do
    IO.puts(score)
    IO.puts(Enum.join(s1_aligned))
    IO.puts(Enum.join(pipes))
    IO.puts(Enum.join(s2_aligned))
  end
end

In closing, just as there are many diff algorithms in computer science, there are many sequence alignment algorithms and parameterizations thereof in biology. We will be using sequence alignment to track how our artificial life form evolves. But also as a mechanism to determine binding affinities between molecules. The latter use is an idea from the stringmol artificial chemistry system.

Top comments (0)