There are useful chompers, not defined in elm/parser, that can help you in the lexical analysis stage of your parser.
What are chompers?
Chompers or the "chomping" family of functions are special functions, that return Parser ()
, that allow you to validate a sequence of characters. elm/parser
defines two such functions called chompIf
and chompWhile
but there are many others, for e.g. chompOptional
, chompOneOrMore
, chompExactly
, chompAtLeast
, chompAtMost
, and chompBetween
, that can be useful to define as well.
What is lexical analysis?
A parser is usually separated into two stages. The first stage does the lexical analysis, or token generation, by which the input characters are split into meaningful symbols defined by a grammar of regular expressions.
For e.g. the meaningful symbols in the context-free grammar for spreadsheet formulas are Text
, Decimal
, Coord
, and Identifier
.
Formula ::= '=' Expr | Text
Expr ::= Number
| Cell
| Range
| Application
Number ::= Decimal
Cell ::= Coord
Range ::= Coord ':' Coord
Application ::= Identifier '(' ( Expr ( ',' Expr )* )? ')'
/* Lexemes */
Text ::= [^=] .*
Decimal ::= '-'? [0-9]+ ('.' [0-9]*)?
Coord ::= [A-Z] ( '0' | [1-9] [0-9]? )
Identifier ::= [a-zA-Z_] [a-zA-Z0-9_]*
Lexical analysis with chompers
You can use chompers to write the equivalent of any regular expression. As a result, chompers become pretty handy for lexical analysis.
For e.g. the lexemes Text
, Decimal
, Coord
, and Identifier
are defined by regular expressions in the grammar above. This means we can rewrite them in elm/parser
using a combination of chompers. Decimal
could be defined as follows:
decimal : Parser Float
decimal =
chompDecimal
|> P.mapChompedString (\s _ -> String.toFloat s |> Maybe.withDefault 0)
|> lexeme
chompDecimal : Parser ()
chompDecimal =
let
decimalPart =
P.succeed ()
|. P.chompIf ((==) '.')
|. P.chompWhile Char.isDigit
in
P.succeed ()
|. chompOptional ((==) '-')
|. chompOneOrMore Char.isDigit
|. chompZeroOrOne decimalPart
Read the Lexer.elm module to see how the rest of them could be defined.
N.B. Assume elm/parser is installed and the following imports have been made available in the code snippets below.
import Parser as P exposing ((|.), (|=), Parser)
chompOptional
It validates zero or one occurrences of the matching characters. It is equivalent to the question mark ?
in regular expressions.
chompOptional : (Char -> Bool) -> Parser ()
chompOptional isGood =
P.oneOf
[ P.chompIf isGood
, P.succeed ()
]
For e.g. we can define optionalSign ::= ( '+' | '-' )?
as follows:
chompOptionalSign : Parser ()
chompOptionalSign =
chompOptional (\c -> c == '+' || c == '-')
Check out a full example.
chompOneOrMore
It validates one or more occurrences of the matching characters. It is equivalent to the plus sign +
in regular expressions.
chompOneOrMore : (Char -> Bool) -> Parser ()
chompOneOrMore isGood =
P.succeed ()
|. P.chompIf isGood
|. P.chompWhile isGood
For e.g. we can define natural ::= [0-9]+
as follows:
chompNatural : Parser ()
chompNatural =
chompOneOrMore Char.isDigit
Check out a full example.
chompExactly
chompExactly n
is equivalent to {n}
in regular expressions. It validates that the matching characters occurred exactly n
times.
Here's a straightforward but slow implementation:
chompExactlySlow : Int -> (Char -> Bool) -> Parser ()
chompExactlySlow n isGood =
if n <= 0 then
P.succeed ()
else
-- if n > 0 then
P.succeed ()
|. P.chompIf isGood
|. chompExactlySlow (n - 1) isGood
And, here's a faster implementation that uses loop
:
chompExactlyFast : Int -> (Char -> Bool) -> Parser ()
chompExactlyFast n isGood =
P.loop n
(\i ->
if i <= 0 then
P.succeed <| P.Done ()
else
P.chompIf isGood
|> P.andThen (\_ -> P.succeed (P.Loop (i - 1)))
)
I ran some benchmarks and it turns out that chompExactlyFast
is about 70% faster than chompExactlySlow
.
As an example, you can use chompExactly
to help you parse ZIP+4 formatted zip codes, where zipCode ::= [0-9]{5} '-' [0-9]{4}
.
type alias ZipCode =
{ basic : String
, geoSegment : String
}
zipCode : Parser ZipCode
zipCode =
P.succeed ZipCode
|= nDigits 5
|. P.chompIf ((==) '-')
|= nDigits 4
nDigits : Int -> Parser String
nDigits =
P.getChompedString << chompNDigits
chompNDigits : Int -> Parser ()
chompNDigits n =
chompExactly n Char.isDigit
Check out a full example.
chompAtLeast
chompAtLeast min
is equivalent to {min,}
in regular expressions. It validates that the matching characters occurred min
or more times.
chompAtLeast : Int -> (Char -> Bool) -> Parser ()
chompAtLeast min isGood =
P.succeed ()
|. chompExactly min isGood
|. P.chompWhile isGood
Check out some examples of chompAtLeast
.
chompAtMost
chompAtMost max
is equivalent to {, max}
in regular expressions. It validates that the matching characters occurred up to max
times.
chompAtMost : Int -> (Char -> Bool) -> Parser ()
chompAtMost max isGood =
P.loop 0
(\i ->
if i < max then
P.oneOf
[ P.chompIf isGood
|> P.andThen (\_ -> P.succeed (P.Loop (i + 1)))
, P.succeed <| P.Done ()
]
else
P.succeed <| P.Done ()
)
Firstly, notice that chompAtMost max
always succeeds just like chompWhile
. The only difference is that it is bounded by max
. While i < max
we try to consume the next character, P.chompIf isGood
. But, if it fails that's fine since it means we've matched all the characters we could match up to this point, i.e. i
of them. That's why we stop with success. In the case, i == max
, we've match the most we're allowed to match so again we stop with success.
Check out some examples of chompAtMost
.
chompBetween
chompBetween min max
is equivalent to {min, max}
in regular expressions. It validates that the matching characters occurred at least min
times, but not more than max
times.
chompBetween : Int -> Int -> (Char -> Bool) -> Parser ()
chompBetween min max isGood =
if min > max then
P.succeed ()
else
let
l =
Basics.max min 0
h =
Basics.max l max
in
chompExactly l isGood
|> P.andThen (\_ -> chompAtMost (h - l) isGood)
Given, 0 < min <= max
, then to implement chompBetween min max
we first try to match exactly min
characters. Then, the most characters we're allowed to match after that is max - min
.
Check out some examples of chompBetween
.
Conclusion
Chompers work at the character level just like regular expressions. They can be used to describe the lexemes or tokens of your grammar. Because of this, I recommend putting your chomping functions in a Lexer.elm
module and having a Parser.elm
module that depends on it.
You can view an example of my approach in my formula lexer and formula parser (here's the grammar) for the 7GUIs Cells task.
For now I'm just collecting useful chompers in dwayne/elm-chompers and some day I may turn it into an Elm package, but for now I'm still exploring. Feel free to copy and paste these chompers into your own projects if you find them useful.
Top comments (3)
Thank you for the excellent article.
One question has been bothering me for a long time.
Just to give you contrived example, given string
<h1>Title: Something</h1>
how do I captureTitle: Something
between the tags equivalent to regex>(.*?)<
.For this particular case you can do something like the following:
Here's the full example.
The picture 🤩
To be clear, I also liked the article, but I felt that the picture deserved a special mention ^_^