Last week, I tried this problem using Haskell (you can see the solution here) as an exploration of point free style and composition.
So let's now attempt this problem another way, using FParsec, a fully functional parser combinator library.
Without getting too much into the details, Parser Combinator libraries operate on the principle of building parsers for large, complex, or intricate data structure, by combining smaller parsers together.
So let's jump in.
(Taken from exercism):
RNA can be broken into three nucleotide sequences called codons, and then translated to a polypeptide like so:
RNA: "AUGUUUUCU" => translates to
Codons: "AUG", "UUU", "UCU" => which become a polypeptide with the following sequence =>
Protein: "Methionine", "Phenylalanine", "Serine"
There are also three terminating codons (also known as 'STOP' codons); if any of these codons are encountered (by the ribosome), all translation ends and the protein is terminated.
All subsequent codons after are ignored, like this:
RNA: "AUGUUUUCUUAAAUG" =>
Codons: "AUG", "UUU", "UCU", "UAA", "AUG" =>
Protein: "Methionine", "Phenylalanine", "Serine"
Note the stop codon "UAA" terminates the translation and the final methionine is not translated into the protein sequence.
Below are the codons and resulting Amino Acids needed for the exercise.
- AUG -> Methionine
- UUU, UUC -> Phenylalanine
- UUA, UUG -> Leucine
- UCU, UCC, UCA, UCG -> Serine
- UAU, UAC -> Tyrosine
- UGU, UGC -> Cysteine
- UGG -> Tryptophan
- UAA, UAG, UGA -> STOP
To call this solution solved, our requirements are as follows:
- Define a function proteins that accepts a
string, and returns an Option type which either produces a
Listof Strings representing our proteins or
None. This is the original goal of the exercise.
- An additional requirement that our solution stop consuming inputs once the appropriate output has been achieved.
It turns out we only need to import the
FParsec library. Everything else we need is a native function.
#r "nuget: FParsec" open FParsec
Let's create some types to represent our domain space. In this case, we just need presentation of our codons and our proteins.
We're going to represent our codons as two separate types,
ProteinCodon, which can be translated into protein amino acids, and
StopCodon, which represent
type ProteinCodon = AUG | UUU | UUC | UUA | UUG | UCU | UCC | UCA | UCG | UAU | UAC | UGU | UGC | UGG type StopCodon = UAA | UAG | UGA type Protein = Methionine | Phenylalanine | Leucine | Serine | Tyrosine | Cysteine | Tryptophan
We need a few functions to help us out here.
toProtein takes a
ProteinCodon and returns the corresponding
let toProtein = function | AUG -> Methionine | UUU -> Phenylalanine | UUC -> Phenylalanine | UUA -> Leucine | UUG -> Leucine | UCU -> Serine | UCC -> Serine | UCA -> Serine | UCG -> Serine | UAU -> Tyrosine | UAC -> Tyrosine | UGU -> Cysteine | UGC -> Cysteine | UGG -> Tryptophan
We can test that running this function works
toProtein UGG Out: Tryptophan
The most complex part of this is our generic pUnion parser. Without getting too much into the details, we use the FParsec functions
stringReturn accepts a string to parse against and a type to return and makes a Parser of that type.
choice accepts a sequence of many parsers of the same type and returns a Parser of that type.
We define a helper function
pCase that takes a union type value and makes a parser for that value specifically, and then we map that function across all cases in a union for full coverage.
Putting it all together, we've defined a parametrized parser that accepts a sum type like one of the codons and makes a parser that will compare a string to see if it's representations.
open FSharp.Reflection let pUnion<'t> : Parser<'t, unit> = let pCase (case:UnionCaseInfo) = stringReturn case.Name (FSharpValue.MakeUnion(case, [||]) :?> 't) FSharpType.GetUnionCases typeof<'t> |> Seq.map pCase |> choice
Thus we can say
pUnion<ProteinCodon> and get a parser of type
Parser<ProteinCodon, unit>, or
pUnion<Protein> to get a parser of type
Now we can start combining some parsers together to get what we want.
let pProteinString = pUnion<ProteinCodon> |>> toProtein |>> fun p -> p.ToString()
pProteinString above starts with the parser for protein codons
pUnion<ProteinCocon> and then uses the
|>> operator to map the parser results to
toProtein and that map those value to the
ToString method. The result is a Parser capable of checking if a string matches a
ProteinCodon, but returns a string representing a
Protein. It's full return type is
Let's see what it tells us with a valid input. The run function is what we use to actually execute our parsers on some data.
run pProteinString "AUG" Out: "Methionine"
Great! That did exactly what we wanted. It parsed the string representing a codon, but returned the value of the codon converted to a protein, converted to a string. Let's now try an invalid input. Something that shouldn't be parsable into a protein.
run pProteinString "UGA" Out: Error in Ln: 1 Col: 1 UGA ^ Expecting: 'AUG', 'UAC', 'UAU', 'UCA', 'UCC', 'UCG', 'UCU', 'UGC', 'UGG', 'UGU' , 'UUA', 'UUC', 'UUG' or 'UUU'
Error! or well, successful error? Our parser successfully rejected the
StopCodon string we passed in, and told us it was looking for a string representing a
let pStopCodon = pUnion<StopCodon> >>% () let pStopCodonOrEOL = pStopCodon <|> eof
Next, we make the parser for our
pStopCodon. Note that
pStopCodon parses a
StopCodon, but then uses the FParsec operator
>>% to ignore that value and return a
unit instead. This is so we can combine it with the
eol (end of line) parser in
<|> is another
FParsec operator which takes two parsers and returns a parser that can be either one, so long as they're the same type. Since
eol returns a unit () we needed to get stop to do the same.
run pStopCodonOrEOL "UGA" //valid StopCodon input Out: <null> run pStopCodonOrEOL "" // valid "end of line" input Out: <null> run pStopCodonOrEOL "xxx" //invalid input Out: Error in Ln: 1 Col: 1 xxx ^ Expecting: end of input, 'UAA', 'UAG' or 'UGA'
Success! Our parser successfully accepts a string that represents a
StopCodon or the end of a string but rejects anything else. Note that our "unit" type output from the successful parsers is returned as
let pProteinStringList = manyTill pProteinString pStopCodonOrEOL
Finally we put it all together in the
pProteinStringList parser. Here we use the
manyTill is a function that takes two parsers; it will continue to collect values of the first until it hits a value of the second, and then returns the values from the first as a list.
This is conveniently, exactly what we need for this challenge!
let proteins s = match run pProteinStringList s with | Success (result, _, _) -> Some result | Failure _ -> None
Now we can finally put it all together with our proteins functions which accepts a string and uses the
run which is what actually applies our parsers to some string.
run returns a
ParserResult which is a
Success if the input was parsed without error and a
Failure with error messages if it doesn't. Here we're choosing to ignore all the extra meta data and just convert these to an option> type.
proteins "AUG" // returns [ "Methionine" ] Out: [ "Methionine" ] proteins "UGA" // returns  Out: [ ] proteins "AUGUCUUAGAUG" // returns [ "Methionine", "Serine" ] Out: [ "Methionine", "Serine" ]
And that's it. It all works. Thank you for reading.
Without the comments, it's pretty concise! Most of the code is just setting up our domain! And we haven't had to define any loops, lists, maps, or finicky control logic. Our controls are baked into our data types. Yay parsers!
#r "nuget: FParsec" open FParsec type ProteinCodon = AUG | UUU | UUC | UUA | UUG | UCU | UCC | UCA | UCG | UAU | UAC | UGU | UGC | UGG type StopCodon = UAA | UAG | UGA type Protein = Methionine | Phenylalanine | Leucine | Serine | Tyrosine | Cysteine | Tryptophan let toProtein = function | AUG -> Methionine | UUU -> Phenylalanine | UUC -> Phenylalanine | UUA -> Leucine | UUG -> Leucine | UCU -> Serine | UCC -> Serine | UCA -> Serine | UCG -> Serine | UAU -> Tyrosine | UAC -> Tyrosine | UGU -> Cysteine | UGC -> Cysteine | UGG -> Tryptophan let pUnion<'t> : Parser<'t, unit> = let pCase (case:UnionCaseInfo) = stringReturn case.Name (FSharpValue.MakeUnion(case, [||]) :?> 't) FSharpType.GetUnionCases typeof<'t> |> Seq.map pCase |> choice let pProteinString = pUnion<ProteinCodon> |>> toProtein |>> fun p -> p.ToString() let pStopCodon = pUnion<StopCodon> >>% () let pStopCodonOrEOL = pStopCodon <|> eof let pProteinStringList = manyTill pProteinString pStopCodonOrEOL let proteins s = match run pProteinStringList s with | Success (result, _, _) -> Some result | Failure _ -> None