This article/workbook is my attempt to solve the Protein Translation Problem on exercism.io. The whole thing was done in a .Net Interactive Workbook.
Last week, I tried this problem using Haskell (you can see the solution here) as an exploration of point free style and composition.
So let's now attempt this problem another way, using FParsec, a fully functional parser combinator library.
Without getting too much into the details, Parser Combinator libraries operate on the principle of building parsers for large, complex, or intricate data structure, by combining smaller parsers together.
So let's jump in.
Describing the problem space
(Taken from exercism):
RNA can be broken into three nucleotide sequences called codons, and then translated to a polypeptide like so:
RNA: "AUGUUUUCU" => translates to
Codons: "AUG", "UUU", "UCU" => which become a polypeptide with the following sequence =>
Protein: "Methionine", "Phenylalanine", "Serine"
There are also three terminating codons (also known as 'STOP' codons); if any of these codons are encountered (by the ribosome), all translation ends and the protein is terminated.
All subsequent codons after are ignored, like this:
RNA: "AUGUUUUCUUAAAUG" =>
Codons: "AUG", "UUU", "UCU", "UAA", "AUG" =>
Protein: "Methionine", "Phenylalanine", "Serine"
Note the stop codon "UAA" terminates the translation and the final methionine is not translated into the protein sequence.
Below are the codons and resulting Amino Acids needed for the exercise.
- AUG -> Methionine
- UUU, UUC -> Phenylalanine
- UUA, UUG -> Leucine
- UCU, UCC, UCA, UCG -> Serine
- UAU, UAC -> Tyrosine
- UGU, UGC -> Cysteine
- UGG -> Tryptophan
- UAA, UAG, UGA -> STOP
Defining our goal
To call this solution solved, our requirements are as follows:
- Define a function proteins that accepts a
string
, and returns an Option type which either produces aList
of Strings representing our proteins orNone
. This is the original goal of the exercise. - An additional requirement that our solution stop consuming inputs once the appropriate output has been achieved.
Importing the relevant libraries
It turns out we only need to import the FParsec
library. Everything else we need is a native function.
#r "nuget: FParsec"
open FParsec
Defining our domain
Let's create some types to represent our domain space. In this case, we just need presentation of our codons and our proteins.
We're going to represent our codons as two separate types, ProteinCodon
, which can be translated into protein amino acids, and StopCodon
, which represent STOP
.
type ProteinCodon = AUG | UUU | UUC | UUA | UUG | UCU | UCC | UCA | UCG | UAU | UAC | UGU | UGC | UGG
type StopCodon = UAA | UAG | UGA
type Protein = Methionine | Phenylalanine | Leucine | Serine | Tyrosine | Cysteine | Tryptophan
Defining our helper functions
We need a few functions to help us out here. toProtein
takes a ProteinCodon
and returns the corresponding Protein
.
let toProtein =
function
| AUG -> Methionine
| UUU -> Phenylalanine
| UUC -> Phenylalanine
| UUA -> Leucine
| UUG -> Leucine
| UCU -> Serine
| UCC -> Serine
| UCA -> Serine
| UCG -> Serine
| UAU -> Tyrosine
| UAC -> Tyrosine
| UGU -> Cysteine
| UGC -> Cysteine
| UGG -> Tryptophan
We can test that running this function works
toProtein UGG
Out[1]:
Tryptophan
Defining our basic parsers
The most complex part of this is our generic pUnion parser. Without getting too much into the details, we use the FParsec functions stringReturn
and choice
.
stringReturn accepts a string to parse against and a type to return and makes a Parser of that type.
choice accepts a sequence of many parsers of the same type and returns a Parser of that type.
We define a helper function pCase
that takes a union type value and makes a parser for that value specifically, and then we map that function across all cases in a union for full coverage.
Putting it all together, we've defined a parametrized parser that accepts a sum type like one of the codons and makes a parser that will compare a string to see if it's representations.
open FSharp.Reflection
let pUnion<'t> : Parser<'t, unit> =
let pCase (case:UnionCaseInfo) =
stringReturn case.Name (FSharpValue.MakeUnion(case, [||]) :?> 't)
FSharpType.GetUnionCases typeof<'t> |> Seq.map pCase |> choice
Thus we can say pUnion<ProteinCodon>
and get a parser of type Parser<ProteinCodon, unit>
, or pUnion<Protein>
to get a parser of type Parser<Protein, unit>
Composing our parsers
Now we can start combining some parsers together to get what we want.
let pProteinString = pUnion<ProteinCodon> |>> toProtein |>> fun p -> p.ToString()
pProteinString
above starts with the parser for protein codons pUnion<ProteinCocon>
and then uses the |>>
operator to map the parser results to toProtein
and that map those value to the ToString
method. The result is a Parser capable of checking if a string matches a ProteinCodon
, but returns a string representing a Protein
. It's full return type is Parser<string, unit>
.
Let's see what it tells us with a valid input. The run function is what we use to actually execute our parsers on some data.
run pProteinString "AUG"
Out[1]:
"Methionine"
Great! That did exactly what we wanted. It parsed the string representing a codon, but returned the value of the codon converted to a protein, converted to a string. Let's now try an invalid input. Something that shouldn't be parsable into a protein.
run pProteinString "UGA"
Out[1]:
Error in Ln: 1 Col: 1 UGA ^ Expecting: 'AUG', 'UAC', 'UAU', 'UCA', 'UCC', 'UCG', 'UCU', 'UGC', 'UGG', 'UGU' , 'UUA', 'UUC', 'UUG' or 'UUU'
Error! or well, successful error? Our parser successfully rejected the StopCodon
string we passed in, and told us it was looking for a string representing a ProteinCodon
!
Let's make some more parsers.
let pStopCodon = pUnion<StopCodon> >>% ()
let pStopCodonOrEOL = pStopCodon <|> eof
Next, we make the parser for our StopCodon
, pStopCodon
. Note that pStopCodon
parses a StopCodon
, but then uses the FParsec operator >>%
to ignore that value and return a unit
instead. This is so we can combine it with the eol
(end of line) parser in pStopOrEol
.
<|>
is another FParsec
operator which takes two parsers and returns a parser that can be either one, so long as they're the same type. Since eol
returns a unit () we needed to get stop to do the same.
run pStopCodonOrEOL "UGA" //valid StopCodon input
Out[1]:
<null>
run pStopCodonOrEOL "" // valid "end of line" input
Out[1]:
<null>
run pStopCodonOrEOL "xxx" //invalid input
Out[1]:
Error in Ln: 1 Col: 1 xxx ^ Expecting: end of input, 'UAA', 'UAG' or 'UGA'
Success! Our parser successfully accepts a string that represents a StopCodon
or the end of a string but rejects anything else. Note that our "unit" type output from the successful parsers is returned as <null>
.
let pProteinStringList = manyTill pProteinString pStopCodonOrEOL
Finally we put it all together in the pProteinStringList
parser. Here we use the FParsec
function manyTill
.
manyTill
is a function that takes two parsers; it will continue to collect values of the first until it hits a value of the second, and then returns the values from the first as a list.
This is conveniently, exactly what we need for this challenge!
Running our solution
let proteins s =
match run pProteinStringList s with
| Success (result, _, _) -> Some result
| Failure _ -> None
Now we can finally put it all together with our proteins functions which accepts a string and uses the FParsec
function run
which is what actually applies our parsers to some string.
run
returns a ParserResult
which is a Success
if the input was parsed without error and a Failure
with error messages if it doesn't. Here we're choosing to ignore all the extra meta data and just convert these to an option> type.
proteins "AUG" // returns [ "Methionine" ]
Out[1]:
[ "Methionine" ]
proteins "UGA" // returns []
Out[1]:
[ ]
proteins "AUGUCUUAGAUG" // returns [ "Methionine", "Serine" ]
Out[1]:
[ "Methionine", "Serine" ]
And that's it. It all works. Thank you for reading.
Here's the entire solution:
Without the comments, it's pretty concise! Most of the code is just setting up our domain! And we haven't had to define any loops, lists, maps, or finicky control logic. Our controls are baked into our data types. Yay parsers!
#r "nuget: FParsec"
open FParsec
type ProteinCodon = AUG | UUU | UUC | UUA | UUG | UCU | UCC | UCA | UCG | UAU | UAC | UGU | UGC | UGG
type StopCodon = UAA | UAG | UGA
type Protein = Methionine | Phenylalanine | Leucine | Serine | Tyrosine | Cysteine | Tryptophan
let toProtein =
function
| AUG -> Methionine
| UUU -> Phenylalanine
| UUC -> Phenylalanine
| UUA -> Leucine
| UUG -> Leucine
| UCU -> Serine
| UCC -> Serine
| UCA -> Serine
| UCG -> Serine
| UAU -> Tyrosine
| UAC -> Tyrosine
| UGU -> Cysteine
| UGC -> Cysteine
| UGG -> Tryptophan
let pUnion<'t> : Parser<'t, unit> =
let pCase (case:UnionCaseInfo) =
stringReturn case.Name (FSharpValue.MakeUnion(case, [||]) :?> 't)
FSharpType.GetUnionCases typeof<'t> |> Seq.map pCase |> choice
let pProteinString = pUnion<ProteinCodon> |>> toProtein |>> fun p -> p.ToString()
let pStopCodon = pUnion<StopCodon> >>% ()
let pStopCodonOrEOL = pStopCodon <|> eof
let pProteinStringList = manyTill pProteinString pStopCodonOrEOL
let proteins s =
match run pProteinStringList s with
| Success (result, _, _) -> Some result
| Failure _ -> None
Top comments (0)