Danie Palm

Posted on Apr 14, 2020 • Edited on Sep 6, 2020

Proteins as programs - a Joy parser in Elixir

#elixir #joy #leex #yecc

If we consider all the molecules in a cell, and particularly those involved in metabolism, and we connect them all in a graph with links between molecules that can be interconverted, we end up with an enormous structure representing all the possible chemical reactions in the cell. While many of these reactions are thermodynamically favourable, we see only a fraction of these occurring at significant rates in living systems.

That is rather fortunate, because any living system that races uncontrollably towards equilibrium races towards death. Most reactions, even thermodynamically favourable ones, need a bit of a nudge to get going. This is where enzymes come in. Enzymes are proteins that act as catalysts that lower the activation energy of the biological processes in cells. Most enzymes catalyse only very specific reactions and are themselves regulated very precisely, elevating just the right reactions at the right time above the insignificantly slow background reaction noise. In this way, cells maintain significant reaction fluxes in a state far from equilibrium.

No living system, whether real or artificial, can function without enzymes. Being the actors in cells, they constitute a significant part of the "efficient" in "closure to efficient causation", a fundamental property of life. In this third post in the series, we explore some similarities between proteins and programs.

Code

Conventional math notation is a mixture of infix and prefix notation. The common operators like + and - go in between the arguments on which they operate, as in 2 + 2. Whereas symbols representing functions typically precede their arguments, like the a in a(2, 2). While this works, and we are all used to it, it has some shortcomings. The order in which symbols are written are often not the order in which they need to be applied. We rely on arbitrary rules of operator precedence to remember that multiplication trumps addition. Similarly, when two functions are composed we write f(g(x)), but then apply g before f.

Reverse Polish notation (RPN) is an alternative to infix and prefix (Polish) notation in which operators and functions alike follow their arguments, as in 2 2 + or 2 2 a. There is no need to rely on operator precedence or parentheses, because operators and functions are simply applied in the order in which they appear from left to right. Take, for example, the expression 4 2 2 + *, which evaluates to 16. The same holds for function composition: x g f (equivalent to the f(g(x))) of above), clearly shows that g is applied before f.

If this feels wrong, it is probably because we are trained from an early age to get used to infix notation. But perhaps there is also something fundamentally hard about RPN for humans. Humans are not good at keeping things in working memory, but in RPN you often have to keep values in memory across various steps. Consider, for instance, the 4 in 4 2 2 + *. It needs to be kept in memory until right at the end when it is used in the final multiplication step. This memory problems encountered in this particular case can of course be overcome by considering the equivalent expresion 2 2 + 4 * instead.

Computers on the other hand, don't have the short term memory issues that we do. While operators are always applied immediately in RPN, values sometimes have to be stored and retrieved in a last-in-first-out scheme. Stacks, being last-in-first-out data structures, are therefore ideal for the evaluation of expressions written in RPN.

Let's evaluate 4 2 2 + * with a stack. We first encounter 4, a normal integer, and put it on the stack.

We then encounter two 2s, which we also put on the stack in the same way:

2
2
4

Now, we encounter + which is a binary operator (taking two arguments). We pop two items off the top of the stack, apply the addition step on them, and push the result (4) to the stack:

4
4

Finally, we encounter *, which is binary as well, and therefore pop two items off the top of the stack, apply the multiplication step on them, and push the result 16 to the top of the stack:

There is nothing left of the expression and so we pop the item at the top of the stack and take it to be the result.

The majority of programming languages also makes use of infix and prefix notation. And that is probably for the best. But code snippets sometimes reveal an underlying desire to write things in the order in which they need to be applied.

Take, for instance, the function invocation c(b(a(x))). Developers often start at the a(x) part and then build up the rest in reverse, reverting to temporary variables:

a_result = a(x)
b_result = b(a_result)
c_result = c(b_result)

Elixir is no exception in this regard, but being averse to pointless variable assignment, introduced the pipe operator (|>), which almost works like the bash pipe (|). Using the pipe operator, the above snippet can be written as result = x |> a |> b |> c, which is as RPN as it gets.

Some languages, the concatenative languages, lean towards the opposite extreme and only allow RPN. Some allow prefix and infix in exceptional cases. Examples of these include Forth, Cat, Factor, Kitten, and my personal favourite Joy.

Joy was released by Manfred von Thun in 2001. While Joy has a sizeable standard library, making it useful as a regular programming language, it was born out of Von Thun's computer scientific desire to develop a language with syntactical features that make it highly amenable to algebraic analysis. Lambda expressions and formal parameters are some of the things that you won't find in Joy. A Joy program is a series of concatenated functions. In Joy, concatenation in the syntactical domain maps onto function composition in the semantic domain.

Let's see how we can convert a list of integers to the corresponding list of squares in Joy:

[2 3 4] [dup *] map

The first item that we encounter is the list [2 3 4]. But really, this is a function of which the result is a stack with [2 3 4] at the top:

[2 3 4]

The next item is a quotation [dup *]. Quotations are plain old lists, except that they may also contain functions. But quotations are also functions themselves, having the effect of putting themselves on the top of the stack:

[dup *]
[2 3 4]

Finally, we encounter the function map, which expects a quotation at the top of the stack, and a list underneath that. It takes the quotation and applies it to each element in the list underneath it, and finally returns a new list containing the results. Here is the first such application in detail.

After encountering the 2:

2 # `2` is a function that places the value `2` on the stack

Evaluating dup, which duplicates the head of the stack:

2
2

Evaluating *, which multiplies the two top elements of the stack:

And hence the overall result is:

[4 9 16]

You will have noticed that primitives like numbers, characters, booleans, and lists that are expressed in other languages as literals, are all treated in Joy as functions that place themselves on the stack.

We have mentioned that Joy has no formal arguments. There is a bit of sleight of hand involved here, though, because all functions take the stack as input, optionally operate on the top few elements, and return the stack as output. This is, as they say, semantics, because syntactically there are no function applications or function calls in Joy and thus there are no formal parameters.

Here are a few more Joy functions or combinators: swap, dup, zap, unit, cat, cons, i, dip.

swap swaps the two elements at the top of the stack:

[B] [A] swap == [A] [B]

dup duplicates the top of the stack:

[A] dup == [A] [A]

zap or pop destroys the top of the stack:

[A] zap ==

unit wraps the top of the stack in a quotation:

[A] unit == [[A]]

cat concatenates two quotations at the top of the stack:

[B] [A] cat == [B A]

cons takes the element below the top of the stack and inserts it as the first element of the quotation on top of the stack:

[B] [A] cons == [[B] A]

i interprets or unquotes the quotation at the top of the stack:

[A] i == A

And finally, dip interprets the quotation at the top of the stack and then swaps it with the element below it:

[B] [A] dip == A [B]

Perhaps surprizingly, these eight combinators are enough to make a Turing-complete subset of Joy. That is, any problem that can be calculated at all, can be calculated with only these combinators. Even without any of the numeric or boolean types or strings. All of these can be represented with Church encoding.

But we can do even better. Many of these combinators can be defined in terms of some of the others, indicating that yet a smaller number is sufficient for Turing completeness:

cat  == [[i] dip i] cons cons
unit == [] cons
cons == [unit] dip cat
swap == unit dip
dip == swap unit cat i
i == [[]] dip dip zap

In fact, it can be shown that all combinators can be expressed in terms of only two primitive ones. If this kind of thing interests you, I highly recommend the The Theory of Concatenative Combinators by Brent Kerby, from which all of these examples originate.

Biology

One can be excused for drawing parallels between the sequences of nitrogen bases in DNA and amino acids in proteins, on the one hand, and sequences of functions in programs of concatenative languages like Joy, on the other hand. There is a clear syntactic correspondence, but is there also a semantic correspondence?

DNA is largely an inert information storage molecule. Other molecules act on it to realize its grand blueprint, but it doesn't do any acting in itself.

Proteins are the workhorses of cells. They facilitate the structural integrity of cells, movement, uptake of nutrients, excretion of byproducts, regulation of osmotic potential, hormonal and other signals, and many other processes. Most importantly, in the form of enzymes, proteins catalyse biochemical reactions.

On face value then, DNA and some proteins correspond to quoted programs. They serve as "arguments" to programs and in some cases are executed by programs. Whereas enzymes correspond to programs, sequences of functions.

There is another correspondence between (Joy) programs and proteins. Both possess higher-order structures. Proteins are primarily a linear sequence of amino acids. Stretches of it, however, fold up into helices or sheets, constituting a secondary structure. As a third layer, helices, sheets, and irregular sections fold up into a tertiary structure. Finally, a large number of proteins only function in association with other identical or similar proteins, forming a quaternary structure. Joy programs are primarily a sequence of functions, but some of these elements are (quoted) programs in themselves, giving rise to a final higher order tree structure.

Code

In this section we turn to Elixir, or rather Erlang, in order to build a Joy parser. A parser is a program that turns a sequence of characters into a data structure that captures the underlying syntactic structure of the sequence of characters, often with the purpose of interpreting or executing it as a program in a next step. According to this definition, parsing includes tokenizing, which is not technically the case. Please excuse the ambiguity.

For now, we are only interested in parsing a subset of Joy. One in which there are no numerals, booleans, characters or strings. The only allowed members are functions and quotations (lists of functions). The parser will be used to convert strings like "dup cat [cons swap] i []" into an Elixir list of atoms like [:dup :cat [:cons :swap] :i []].

Before we start with the parser itself, it will be useful to discuss
formal grammars and context-free grammars in particular. A formal grammar is a set of production rules of the form lhs -> rhs that specifies how non-terminal symbols can be transformed to other non-terminal or terminal symbols. Non-terminal symbols are symbols that can be subjected to yet more production rules, whereas terminal symbols cannot be transformed any further. A context-free grammar is one in which the lhs can only be a single non-terminal symbol. The start symbol is a special non-terminal symbol that initiates the whole process.

A formal grammar therefore specifies a formal language, the set of all strings that can be produced by starting at the start symbol and systematically and exhaustively applying all production rules until only terminal symbols are left.

The following context-free grammar generates the formal language that contains all valid programs of the subset of Joy that we are interested in. The start symbol is program, the non-terminal symbols are elems, elem, and quotation. The terminal symbols are [, ], and function. The production rules are:

# Programs can be empty
program ->
# but are usually a set of elements
program -> elems

# But that set can contain a single element
elems -> elem
# or recursively expand into more elements
elems -> elem elems

# An element is either a function
elem -> function
# or a quotation
elem -> quotation

# A quotation is a quoted program
quotation -> [ program ]

There is something missing of course, we are treating function as a terminal, and so we will end up with programs like: function function [function function] function [] as opposed to ones like dup cat [cons swap] i []. To remedy this, we'd need to add more production rules like function -> swap and switch function to a non-terminal instead. We refrain from doing so now, because there are more such rules than I care to type, and because there's a way that will allow us to use a regular expression instead.

Parsing involves two steps: a tokenizing or lexing step, and an actual parsing step. A tokenizer is a program that converts a sequence of characters into a sequence of terminal symbols (as in formal grammar terminal symbols), optionally including some metadata. A parser is a program that takes a sequence of tokens and analyses it according to a formal grammar for syntactical correctness.

These tokenizing and parsing steps are traditionally performed by a duo of programs called lex and yacc, or variants of these developed for particular languages. In Erlang we'll make use of leex and yecc. These programs require us to specify some tokenization rules and a formal grammar using a domain-specific language, which are then compiled to Erlang source code that takes care of the actual lexing and parsing.

Here is our Joy tokenizer rules:

# src/joy_lexer.xrl

Definitions.

function       = [a-z_0-9\-]+
whitespace     = [\s\t\n\r]+

Rules.

{function}        : {token, {function, TokenLine, to_atom(TokenChars)}}.
\[                : {token, {'[',  TokenLine}}.
\]                : {token, {']',  TokenLine}}.
{whitespace}      : skip_token.

Erlang code.

to_atom(Chars) ->
  list_to_atom(Chars).

It starts out with some regular expression definitions for later use. We define what a function looks like and what we consider to be whitespace. Next we specify how sequences of characters, as defined by regular expressions, should be converted to tokens. Anything that matches our whitespace regex, should not be considered to be a token. The literal brackets "[" and "]" should be treated as the tokens [ and ]. And finally, anything that matches our function regex should be treated as a function token, and have its actual value passed along as token metadata. The file ends with a section that contains Erlang code that can be used in any of the rules. In this case, we convert the value of the token to an Erlang/Elixir atom data type.

And here is our parser grammar:

Rootsymbol program.
Nonterminals program quotation elems elem.
Terminals '[' ']' function.

program -> '$empty' : [].
program -> elems : '$1'.

elems -> elem : ['$1'].
elems -> elem elems : ['$1' | '$2'].

elem -> function : extract_value('$1').
elem -> quotation : '$1'.

quotation -> '[' program ']' : '$2'.

Erlang code.

extract_value({_Token, _Line, Value}) -> Value.

You will notice that it is essentially the same grammar that we defined earlier, peppered with some Erlang. The Erlang parts are necessary to tell the parser generated by yecc how to convert the parsed tokens to an Erlang data structure. In essence, we tell yecc that programs are lists and that the empty program is an empty list. List members can be functions or quotations. If a list member is a function (which we represent with an atom), the value must be extracted from the token metadata.

Now all we need to do is to run mix compile in an Elixir project containing these files. Elixir (which also compiles Erlang files by default) will generate corresponding src/joy_lexer.erl and src/joy_parser.erl files, which allows us to define a module like this:

defmodule Joy.Parser do
  def parse(str) do
    with {:ok, tokens, _} <- str |> to_charlist() |> :joy_lexer.string(),
         {:ok, joy} <- :joy_parser.parse(tokens) do
      joy
    end
  end
end

Strings of Joy can now be parsed to Elixir lists with:

iex(1)> Joy.Parser.parse("dup cat [cons swap] i []")
[:dup, :cat, [:cons, :swap], :i, []]

The article Tokenizing and Parsing in Elixir using leex and yecc by Andrea Leopardi has been particularly useful to me in understanding and using leex and yecc.

In the next post, we will build an interpreter in Elixir that is able to execute such lists of atoms.

To conclude, the individual amino acid residues that make up proteins certainly don't have distinct functions in the way that functions in a Joy program do. Nevertheless, the chemical properties of all the amino acids in a protein conspire with the intracellular conditions to fold up into a functional tertiary structure, which as a whole certainly has parallels to a program or sequence of functions that execute to alter some local or global state.

DEV Community

Proteins as programs - a Joy parser in Elixir

Code

Biology

Code

Top comments (0)

Read next

Open-Sora 2.0: High-Quality AI Video Generation Achieved for Just $200K

DiLoCo: New Training Method Cuts AI Model Communication by 32x While Maintaining Performance

AI Model Uses Human-Like Reasoning to Identify Objects in Images Without Training

AI Models Still Can't Learn Physics from Images, Study Shows