DEV Community

Cover image for Solving the Protein Translator problem using F# and FParsec
Kirk Shillingford
Kirk Shillingford

Posted on

Solving the Protein Translator problem using F# and FParsec

This article/workbook is my attempt to solve the Protein Translation Problem on exercism.io. The whole thing was done in a .Net Interactive Workbook.

Last week, I tried this problem using Haskell (you can see the solution here) as an exploration of point free style and composition.

So let's now attempt this problem another way, using FParsec, a fully functional parser combinator library.

Without getting too much into the details, Parser Combinator libraries operate on the principle of building parsers for large, complex, or intricate data structure, by combining smaller parsers together.

So let's jump in.

Describing the problem space

(Taken from exercism):

RNA can be broken into three nucleotide sequences called codons, and then translated to a polypeptide like so:

RNA: "AUGUUUUCU" => translates to

Codons: "AUG", "UUU", "UCU" => which become a polypeptide with the following sequence =>

Protein: "Methionine", "Phenylalanine", "Serine"

There are also three terminating codons (also known as 'STOP' codons); if any of these codons are encountered (by the ribosome), all translation ends and the protein is terminated.

All subsequent codons after are ignored, like this:

RNA: "AUGUUUUCUUAAAUG" =>

Codons: "AUG", "UUU", "UCU", "UAA", "AUG" =>

Protein: "Methionine", "Phenylalanine", "Serine"

Note the stop codon "UAA" terminates the translation and the final methionine is not translated into the protein sequence.

Below are the codons and resulting Amino Acids needed for the exercise.

  • AUG -> Methionine
  • UUU, UUC -> Phenylalanine
  • UUA, UUG -> Leucine
  • UCU, UCC, UCA, UCG -> Serine
  • UAU, UAC -> Tyrosine
  • UGU, UGC -> Cysteine
  • UGG -> Tryptophan
  • UAA, UAG, UGA -> STOP

Defining our goal

To call this solution solved, our requirements are as follows:

  • Define a function proteins that accepts a string, and returns an Option type which either produces a List of Strings representing our proteins or None. This is the original goal of the exercise.
  • An additional requirement that our solution stop consuming inputs once the appropriate output has been achieved.

Importing the relevant libraries

It turns out we only need to import the FParsec library. Everything else we need is a native function.

#r "nuget: FParsec"
open FParsec
Enter fullscreen mode Exit fullscreen mode

Defining our domain

Let's create some types to represent our domain space. In this case, we just need presentation of our codons and our proteins.

We're going to represent our codons as two separate types, ProteinCodon, which can be translated into protein amino acids, and StopCodon, which represent STOP.

type ProteinCodon = AUG | UUU | UUC | UUA | UUG | UCU | UCC | UCA | UCG | UAU | UAC | UGU | UGC | UGG 

type StopCodon = UAA | UAG | UGA

type Protein = Methionine | Phenylalanine | Leucine | Serine | Tyrosine | Cysteine | Tryptophan
Enter fullscreen mode Exit fullscreen mode

Defining our helper functions

We need a few functions to help us out here. toProtein takes a ProteinCodon and returns the corresponding Protein.

let toProtein =
  function
    | AUG -> Methionine
    | UUU -> Phenylalanine
    | UUC -> Phenylalanine
    | UUA -> Leucine
    | UUG -> Leucine
    | UCU -> Serine
    | UCC -> Serine
    | UCA -> Serine
    | UCG -> Serine
    | UAU -> Tyrosine
    | UAC -> Tyrosine
    | UGU -> Cysteine
    | UGC -> Cysteine
    | UGG -> Tryptophan
Enter fullscreen mode Exit fullscreen mode

We can test that running this function works

toProtein UGG

Out[1]:
Tryptophan
Enter fullscreen mode Exit fullscreen mode

Defining our basic parsers

The most complex part of this is our generic pUnion parser. Without getting too much into the details, we use the FParsec functions stringReturn and choice.

stringReturn accepts a string to parse against and a type to return and makes a Parser of that type.

choice accepts a sequence of many parsers of the same type and returns a Parser of that type.

We define a helper function pCase that takes a union type value and makes a parser for that value specifically, and then we map that function across all cases in a union for full coverage.

Putting it all together, we've defined a parametrized parser that accepts a sum type like one of the codons and makes a parser that will compare a string to see if it's representations.

open FSharp.Reflection

let pUnion<'t> : Parser<'t, unit> =
    let pCase (case:UnionCaseInfo) =
        stringReturn case.Name (FSharpValue.MakeUnion(case, [||]) :?> 't)

    FSharpType.GetUnionCases typeof<'t> |> Seq.map pCase |> choice
Enter fullscreen mode Exit fullscreen mode

Thus we can say pUnion<ProteinCodon> and get a parser of type Parser<ProteinCodon, unit>, or pUnion<Protein> to get a parser of type Parser<Protein, unit>

Composing our parsers

Now we can start combining some parsers together to get what we want.

let pProteinString = pUnion<ProteinCodon> |>> toProtein |>> fun p -> p.ToString()
Enter fullscreen mode Exit fullscreen mode

pProteinString above starts with the parser for protein codons pUnion<ProteinCocon> and then uses the |>> operator to map the parser results to toProtein and that map those value to the ToString method. The result is a Parser capable of checking if a string matches a ProteinCodon, but returns a string representing a Protein. It's full return type is Parser<string, unit>.

Let's see what it tells us with a valid input. The run function is what we use to actually execute our parsers on some data.

run pProteinString "AUG"

Out[1]:
"Methionine"
Enter fullscreen mode Exit fullscreen mode

Great! That did exactly what we wanted. It parsed the string representing a codon, but returned the value of the codon converted to a protein, converted to a string. Let's now try an invalid input. Something that shouldn't be parsable into a protein.

run pProteinString "UGA"

Out[1]:
Error in Ln: 1 Col: 1 UGA ^ Expecting: 'AUG', 'UAC', 'UAU', 'UCA', 'UCC', 'UCG', 'UCU', 'UGC', 'UGG', 'UGU' , 'UUA', 'UUC', 'UUG' or 'UUU'
Enter fullscreen mode Exit fullscreen mode

Error! or well, successful error? Our parser successfully rejected the StopCodon string we passed in, and told us it was looking for a string representing a ProteinCodon!

Let's make some more parsers.

let pStopCodon = pUnion<StopCodon> >>% ()

let pStopCodonOrEOL = pStopCodon <|> eof
Enter fullscreen mode Exit fullscreen mode

Next, we make the parser for our StopCodon, pStopCodon. Note that pStopCodon parses a StopCodon, but then uses the FParsec operator >>% to ignore that value and return a unit instead. This is so we can combine it with the eol (end of line) parser in pStopOrEol.

<|> is another FParsec operator which takes two parsers and returns a parser that can be either one, so long as they're the same type. Since eol returns a unit () we needed to get stop to do the same.

run pStopCodonOrEOL "UGA" //valid StopCodon input

Out[1]:
<null>

run pStopCodonOrEOL "" // valid "end of line" input

Out[1]:
<null>

run pStopCodonOrEOL "xxx" //invalid input

Out[1]:
Error in Ln: 1 Col: 1 xxx ^ Expecting: end of input, 'UAA', 'UAG' or 'UGA'
Enter fullscreen mode Exit fullscreen mode

Success! Our parser successfully accepts a string that represents a StopCodon or the end of a string but rejects anything else. Note that our "unit" type output from the successful parsers is returned as <null>.

let pProteinStringList = manyTill pProteinString pStopCodonOrEOL
Enter fullscreen mode Exit fullscreen mode

Finally we put it all together in the pProteinStringList parser. Here we use the FParsec function manyTill.

manyTill is a function that takes two parsers; it will continue to collect values of the first until it hits a value of the second, and then returns the values from the first as a list.

This is conveniently, exactly what we need for this challenge!

Running our solution

let proteins s =
    match run pProteinStringList s with
    | Success (result, _, _) -> Some result
    | Failure _ -> None
Enter fullscreen mode Exit fullscreen mode

Now we can finally put it all together with our proteins functions which accepts a string and uses the FParsec function run which is what actually applies our parsers to some string.

run returns a ParserResult which is a Success if the input was parsed without error and a Failure with error messages if it doesn't. Here we're choosing to ignore all the extra meta data and just convert these to an option> type.

proteins "AUG" // returns [ "Methionine" ]

Out[1]:
[ "Methionine" ]

proteins "UGA" // returns []

Out[1]:
[ ]

proteins "AUGUCUUAGAUG" // returns [ "Methionine", "Serine" ]

Out[1]:
[ "Methionine", "Serine" ]
Enter fullscreen mode Exit fullscreen mode

And that's it. It all works. Thank you for reading.

Here's the entire solution:

Without the comments, it's pretty concise! Most of the code is just setting up our domain! And we haven't had to define any loops, lists, maps, or finicky control logic. Our controls are baked into our data types. Yay parsers!

#r "nuget: FParsec"
open FParsec

type ProteinCodon = AUG | UUU | UUC | UUA | UUG | UCU | UCC | UCA | UCG | UAU | UAC | UGU | UGC | UGG 

type StopCodon = UAA | UAG | UGA

type Protein = Methionine | Phenylalanine | Leucine | Serine | Tyrosine | Cysteine | Tryptophan

let toProtein =
  function
    | AUG -> Methionine
    | UUU -> Phenylalanine
    | UUC -> Phenylalanine
    | UUA -> Leucine
    | UUG -> Leucine
    | UCU -> Serine
    | UCC -> Serine
    | UCA -> Serine
    | UCG -> Serine
    | UAU -> Tyrosine
    | UAC -> Tyrosine
    | UGU -> Cysteine
    | UGC -> Cysteine
    | UGG -> Tryptophan

let pUnion<'t> : Parser<'t, unit> =
    let pCase (case:UnionCaseInfo) =
        stringReturn case.Name (FSharpValue.MakeUnion(case, [||]) :?> 't)

    FSharpType.GetUnionCases typeof<'t> |> Seq.map pCase |> choice

let pProteinString = pUnion<ProteinCodon> |>> toProtein |>> fun p -> p.ToString()

let pStopCodon = pUnion<StopCodon> >>% ()

let pStopCodonOrEOL = pStopCodon <|> eof

let pProteinStringList = manyTill pProteinString pStopCodonOrEOL

let proteins s =
    match run pProteinStringList s with
    | Success (result, _, _) -> Some result
    | Failure _ -> None
Enter fullscreen mode Exit fullscreen mode

Discussion (0)