DEV Community

Eric Burden
Eric Burden

Posted on • Originally published at ericburden.work on

Parsing Puzzle Text with Nom

The season is nearly upon us! And by the season, I of course am referring to the Advent of Code (what else?). This year I plan to take my first pass at all the puzzles in Rust, and I’ve already set up my repository. This past year, I went back and finished all the prior years of AoC in Julia (repo here), which was really a delightful experience. I considered on reflection, though, just how much I had leaned on regular expressions for input parsing over the course of those many, many puzzles. The regular expression support in Julia was just too good! However, given that I like to use AoC as a learning and practice exercise, my thoughts drifted to other options for input parsing. I’ve had some small experience with the nom parser combinator crate, and while the menu of sub-modules and functions spread throughout seems a bit daunting at first, using nom for text parsing seems like a handy skill to have. So, I decided to practice a bit on some past AoC puzzles. The input from 2020, Day 4 was interesting enough that I thought it deserved a blog post, so here it is!

The Goal

The puzzle itself probably has a better description of the input, but to save you from needing to navigate over, here’s a brief overview… You’re given a text file with a bunch of “passport” entries, each of which is separated by a blank line, like so:

ecl:gry pid:860033327 eyr:2020 hcl:#fffffd
byr:1937 iyr:2017 cid:147 hgt:183cm

iyr:2013 ecl:amb cid:350 eyr:2023 pid:028048884
hcl:#cfa07d byr:1929

hcl:#ae17e1 iyr:2013
eyr:2024
ecl:brn pid:760753108 byr:1931
hgt:179cm

hcl:#cfa07d eyr:2025 pid:166559648
iyr:2011 ecl:brn hgt:59in

Enter fullscreen mode Exit fullscreen mode

Note that each entry may constitute one or more lines and on each line the individual fields are separated by a single space. Each field is a string that contains a label and a value, separated by a colon like <label>:<value>. My goal is to parse each entry into the most specific possible representation. You may think this is overkill, and you may be right, but remember the goal here is to overuse nom, so there. Here’s the format I went with:

struct Passport<'a> {
    byr: Option<Year<'a>>,
    iyr: Option<Year<'a>>,
    eyr: Option<Year<'a>>,
    hgt: Option<Height<'a>>,
    hcl: Option<HairColor<'a>>,
    ecl: Option<EyeColor<'a>>,
    pid: Option<IdNumber<'a>>,
    cid: Option<IdNumber<'a>>,
}

enum Year<'a> {
    Valid(u16),
    Invalid(&'a str),
}

enum Height<'a> {
    CM(u16),
    IN(u16),
    Invalid(&'a str),
}

enum HairColor<'a> {
    Valid(u8, u8, u8),
    Invalid(&'a str),
}

enum EyeColor<'a> {
    Amber,
    Blue,
    Brown,
    Gray,
    Green,
    Hazel,
    Other,
    Invalid(&'a str),
}

enum IdNumber<'a> {
    Valid(u32),
    Invalid(&'a str),
}

Enter fullscreen mode Exit fullscreen mode

Why, oh, why!? Well, according to the puzzle instructions, each entry may or may not contain each field, and each field may or may not actually be possible to represent correctly, like "ecl:1984" or some junk. So, what am I doing here? Why not just store everything in a HashMap<&str,&str> or something? Specificity. In particular, it’s a good excuse to do lots of parsing. Each field in Passport is optional because the input may or may not contain an entry for that field, and each of the enums has an Invalid variant where I will store the string that couldn’t be parsed into a Valid value. This means that "ecl:1984" would be represented as a Some(EyeColor::Invalid("1984")). In a more “real-world” scenario, this might be really useful, allowing the developer to easily pick out fields that couldn’t be parsed and review their contents. For me, this was super handy for checking examples to determine when my parsers weren’t doing what I expected them to.

Parser Combinators

If you’re not familiar with the term, a “combinator” is a function (typically a small, focused function) that takes in a function and returns a function. If that definition doesn’t make it clear (because of course it doesn’t), you can think of combinators as pluggable pieces of an assembly line, where can piece together a series of combinators in such a way as to build “pipelines” that take input at one end and produce different kinds of output out the other end, depending on the order and arrangement of the pieces. If it’s still a bit fuzzy, then perhaps the examples below will help.

The General Shape

Before we start writing any parsing code, let’s try to define something like a “grammar” for these passports, that is, what structure can we infer from the entries as they’re given. For example, we can see that each entry consists of a list of “field"s separated by either a single space or a newline character. In the image below, each Field is highlighted blue and the Separator s (either a space or newline) are indicated by yellow rectangles.

A passport entry with fields and separators highlighted

Given that each Field represents a Label and a Value , our current structure looks like:

Passport = [(Field, Separator), (Field, Separator), ...]
Separator = " " || "\n"
Field = "<Label>:<Value>"
Label = "byr" || "iyr" || "eyr" || "hgt" || "hcl" || "ecl" || "pid" || "cid"
Value = ????

Enter fullscreen mode Exit fullscreen mode

Let’s keep in mind that there are no promises that only valid label values will be given in the input, so we’ll need to account for that. We know that the form of Value will be given by the content of Label, but we don’t have a good feel for what the options for Value look like yet.

Field Types

Years

There are three different fields that should contain a “year” value, represented by a decimal number. These fields have the label “byr”, “iyr”, or “eyr”.

Passport = [(Field, Separator), (Field, Separator), ...]
Separator = " " || "\n"
Field = "<Label>:<Value>"
Label 
  = "byr" -> Value = Year
  = "iyr" -> Value = Year
  = "eyr" -> Value = Year
  = "hgt" -> Value = ???
  = "hcl" -> Value = ???
  = "ecl" -> Value = ???
  = "pid" -> Value = ???
  = "cid" -> Value = ???
Year = [0-9]+

Enter fullscreen mode Exit fullscreen mode

Yes, I am representing the definition of Year as a regular expression. Sue me.

Height

Height can either be given in inches or centimeters, which is given as a number followed by the suffix “in” or “cm”. This field has the label “hgt”.

Passport = [(Field, Separator), (Field, Separator), ...]
Separator = " " || "\n"
Field = "<Label>:<Value>"
Label 
  = "byr" -> Value = Year
  = "iyr" -> Value = Year
  = "eyr" -> Value = Year
  = "hgt" -> Value = Height
  = "hcl" -> Value = ???
  = "ecl" -> Value = ???
  = "pid" -> Value = ???
  = "cid" -> Value = ???
Year = [0123456789]+
Height = [0123456789]+("in"|"cm")

Enter fullscreen mode Exit fullscreen mode

Hair Color

Hair color should be given as a hex color code, complete with '#'. This field has the label “hcl”.

Passport = [(Field, Separator), (Field, Separator), ...]
Separator = " " || "\n"
Field = "<Label>:<Value>"
Label 
  = "byr" -> Value = Year
  = "iyr" -> Value = Year
  = "eyr" -> Value = Year
  = "hgt" -> Value = Height
  = "hcl" -> Value = HairColor
  = "ecl" -> Value = ???
  = "pid" -> Value = ???
  = "cid" -> Value = ???
Year = [0123456789]+
Height = [0123456789]+("in"|"cm")
HairColor = ("#")[0123456789abcdef]+

Enter fullscreen mode Exit fullscreen mode

Eye Color

Eye color can be one of “amb”, “blu”, “brn”, “gry”, “grn”, “hzl”, or “oth”, and that’s it. This field has the label “ecl”.

Passport = [(Field, Separator), (Field, Separator), ...]
Separator = " " || "\n"
Field = "<Label>:<Value>"
Label 
  = "byr" -> Value = Year
  = "iyr" -> Value = Year
  = "eyr" -> Value = Year
  = "hgt" -> Value = Height
  = "hcl" -> Value = HairColor
  = "ecl" -> Value = EyeColor
  = "pid" -> Value = ???
  = "cid" -> Value = ???
Year = [0123456789]+
Height = [0123456789]+("in"|"cm")
HairColor = ("#")[0123456789abcdef]+
EyeColor = "amb" || "blu" || "brn" || "gry" || "grn" || "hzl" || "oth"

Enter fullscreen mode Exit fullscreen mode

ID Numbers

Both Passport ID and Country ID can be represented as numbers. These are a bit larger than a Year , so we’ll make a distinction between the two. These fields have the label “pid” or “cid”.

Passport = [(Field, Separator), (Field, Separator), ...]
Separator = " " || "\n"
Field = "<Label>:<Value>"
Label 
  = "byr" -> Value = Year
  = "iyr" -> Value = Year
  = "eyr" -> Value = Year
  = "hgt" -> Value = Height
  = "hcl" -> Value = HairColor
  = "ecl" -> Value = EyeColor
  = "pid" -> Value = IdNumber
  = "cid" -> Value = IdNumber
Year = [0123456789]+
Height = [0123456789]+("in"|"cm")
HairColor = ("#")[0123456789abcdef]+
EyeColor = "amb" || "blu" || "brn" || "gry" || "grn" || "hzl" || "oth"
IdNumber = [0123456789]+

Enter fullscreen mode Exit fullscreen mode

With the completed specification, we can start writing functions! We’ll start at the bottom of our specification and build our way up.

Combinator Functions

IdNumber

use nom::{
    bytes::complete::take,
    character::complete::u32,
    combinator::all_consuming,
    IResult,
};

enum IdNumber<'a> {
    Valid(u32),
    Invalid(&'a str),
}

impl<'a> From<&'a str> for IdNumber<'a> {
    fn from(value: &'a str) -> Self {
        match id_number(value) {
            Ok((_, year)) => Self::Valid(year),
            Err(_) => Self::Invalid(value),
        }
    }
}

fn id_number(s: &str) -> IResult<&str, u32> {
    let (_, nine_chars) = all_consuming(take(9usize))(s)?;
    u32(nine_chars)
}
Enter fullscreen mode Exit fullscreen mode

There are a few things to note here. One, the nom functions can take a variety of inputs, but in this case we'll pass a string slice to all of them. The IResult<T, U> from nom is the equivalent of Result<(T, U), E> where E is a nom::error::Error, U is the result of successfully parsing the input, and T is the rest of the unparsed input.

It is generally a good idea to wrap the nom functions in your own functions for the function signature, otherwise you can end up needing some really gnarly type hints if you try to use the parser combinator functions in-line. So, we can understand id_number(&str) -> IResult<&str, u32> to be a function that takes a string slice and parses it into a (&str, u32), where the returned string slice is the rest of the input and the u32 is the resulting number.

I'd recommend taking a look at the list of nom combinators on GitHub. This list makes it much easier to find the combinator you're looking for. Our id_number(&str) -> IResult<&str, u32> function first takes 9 characters from the input and returns an error if there are more than 9 characters there. It then attempts to parse those 9 characters into a u32 and returns the result. We use the nom::character::complete::u32 function instead of _.parse::<u32>() to keep the nom error types.

Beyond that, we've implemented From such that IdNumber::from(&str) tries to parse the given string into a u32 containing exactly 9 digits including leading zeros and, if successful, returns an IdNumber::Valid(u32). Otherwise, you get back an IdNumber::Invalid(&str) containing the original string slice.

There are a few different ways you could structure this, including by implementing FromStr or making id_number() a member function of IdNumber, but the heart of it should remain the same.

EyeColor

enum EyeColor<'a> {
    Amber,
    Blue,
    Brown,
    Gray,
    Green,
    Hazel,
    Other,
    Invalid(&'a str),
}

impl<'a> From<&'a str> for EyeColor<'a> {
    fn from(s: &'a str) -> Self {
        match s {
            "amb" => EyeColor::Amber,
            "blu" => EyeColor::Blue,
            "brn" => EyeColor::Brown,
            "gry" => EyeColor::Gray,
            "grn" => EyeColor::Green,
            "hzl" => EyeColor::Hazel,
            "oth" => EyeColor::Other,
            _ => EyeColor::Invalid(s),
        }
    }
}

Enter fullscreen mode Exit fullscreen mode

I honestly couldn’t come up with a good reason to use any parser functions here, we’re just matching the input string to one of the EyeColor variants or returningEyeColor::Invalid(&str) if we can’t.

HairColor

use nom::{
  bytes::complete::take_while_m_n,
  character::complete::char,
  combinator::map_res,
  sequence::{preceded, tuple},
  IResult,
}

enum HairColor<'a> {
    Valid(u8, u8, u8),
    Invalid(&'a str),
}

impl<'a> From<&'a str> for HairColor<'a> {
    fn from(value: &'a str) -> Self {
        match hex_color_triplet(value) {
            Ok((_, (r, g, b))) => HairColor::Valid(r, g, b),
            Err(_) => HairColor::Invalid(value),
        }
    }
}

/// Use `nom` parser combinators to parse "#ffffff" into (255, 255, 255).
fn hex_color_triplet(s: &str) -> IResult<&str, (u8, u8, u8)> {
    preceded(char('#'), tuple((hex_value, hex_value, hex_value)))(s)
}

/// Use `nom` to parse two hex characters into their u8 value.
fn hex_value(s: &str) -> IResult<&str, u8> {
    let two_hex_bytes = take_while_m_n(2, 2, |c: char| c.is_ascii_hexdigit());
    let convert_to_u8 = |v| u8::from_str_radix(v, 16);
    map_res(two_hex_bytes, convert_to_u8)(s)
}

Enter fullscreen mode Exit fullscreen mode

Now here’s some parsing! We can see that the From<&str> trait implementation is almost exactly the same as for IdNumber, so that’s not too interesting, buthex_value(&str) and hex_color_triplet(&str) really show off what the parser combinator functions can do. hex_color_triplet(&str) looks for a string that starts with ‘#’ and is followed by three substrings that can be parsed byhex_value(&str) which are return in a three-tuple. hex_value(&str) takes the first two characters as long as they satisfy c.is_ascii_hexdigit() and attempts to convert them into a u8 using u8::from_str_radix(&str, 16). Recall that we get a nom::error::Error if any part of this process fails, which ultimately means a HairColor::Invalid(&str).

Height

use nom::{
    branch::alt,
    bytes::complete::tag,
    character::complete::u16,
    sequence::pair,
    IResult,
};

enum Height<'a> {
    CM(u16),
    IN(u16),
    Invalid(&'a str),
}

impl<'a> From<&'a str> for Height<'a> {
    fn from(value: &'a str) -> Self {
        if let Ok((_, (number, unit))) = number_unit_pair(value) {
            match unit {
                "in" => Height::IN(number),
                "cm" => Height::CM(number),
                _ => unreachable!(),
            }
        } else {
            Height::Invalid(value)
        }
    }
}

/// `nom` parser combinator to parse "173cm" into (173, "cm").
fn number_unit_pair(s: &str) -> IResult<&str, (u16, &str)> {
    pair(u16, alt((tag("in"), tag("cm"))))(s)
}

Enter fullscreen mode Exit fullscreen mode

Height is interesting because we could be looking at a Height::CM(u16) or aHeight::IN(u16) depending on the last two letters of the input. number_unit_pair(&str)is looking for a number that can be parsed to a u16 followed by either “in” or “cm” and returns as the output a tuple of the number and the suffix as a string. With that in hand, it’s straightforward to wrap the number in the appropriate variant of Height.

Year

use nom::{
    bytes::complete::take,
    character::complete::u16,
    combinator::all_consuming,
    IResult,
};

enum Year<'a> {
    Valid(u16),
    Invalid(&'a str),
}

impl<'a> From<&'a str> for Year<'a> {
    fn from(value: &'a str) -> Self {
        match year(value) {
            Ok((_, year)) => Self::Valid(year),
            Err(_) => Self::Invalid(value),
        }
    }
}

fn year(s: &str) -> IResult<&str, u16> {
    let (_, four_chars) = all_consuming(take(4usize))(s)?;
    u16(four_chars)
}
Enter fullscreen mode Exit fullscreen mode

This is almost exactly the same as IdNumber, just with a u16 instead of a u32 and four digits instead of nine. Specificity!

Passport

With all the pieces in place, we’re ready to parse an entire Passport!

use nom::{
    branch::alt,
    bytes::complete::{is_not, tag, take},
    multi::separated_list1,
    sequence::separated_pair,
    IResult,
};

/// Parses a line-separated string into a list of the results of attempting
/// to parse each entry into a `Passport`.
fn parse_file(s: &str) -> Vec<Result<Passport, ParsingError>> {
    s.split("\n\n").map(Passport::try_from).collect()
}

enum ParsingError<'a> {
    InvalidPassport(&'a str),
    InvalidField(&'a str),
}

#[derive(Debug, Default)]
struct Passport<'a> {
    byr: Option<Year<'a>>,
    iyr: Option<Year<'a>>,
    eyr: Option<Year<'a>>,
    hgt: Option<Height<'a>>,
    hcl: Option<HairColor<'a>>,
    ecl: Option<EyeColor<'a>>,
    pid: Option<IdNumber<'a>>,
    cid: Option<IdNumber<'a>>,
}

/// Create a `Passport` based on a chunk of the input string.
impl<'a> TryFrom<&'a str> for Passport<'a> {
    type Error = ParsingError<'a>;

    fn try_from(s: &'a str) -> Result<Self, Self::Error> {
        // If we don't get back a list of (<field name>, <field value>) from the parsing
        // function, then it's an invalid passport entry.
        let (_, fields) = field_list(s).map_err(|_| ParsingError::InvalidPassport(s))?;

        // Initialize a new passport with all the optional fields set to `None`.
        let mut passport = Passport::default();

        // Check each field label, parse its value, and put the value in the right
        // `Passport` field. If the label is unrecognizable, then it's an invalid
        // field and gets reported. In hindsight, instead of returning an error 
        // here, you could just store the un-parseable field entries in the `Passport`
        // struct for later review.
        for (label, value) in fields {
            match label {
                "byr" => passport.byr = Some(Year::from(value)),
                "iyr" => passport.iyr = Some(Year::from(value)),
                "eyr" => passport.eyr = Some(Year::from(value)),
                "hgt" => passport.hgt = Some(Height::from(value)),
                "hcl" => passport.hcl = Some(HairColor::from(value)),
                "ecl" => passport.ecl = Some(EyeColor::from(value)),
                "pid" => passport.pid = Some(IdNumber::from(value)),
                "cid" => passport.cid = Some(IdNumber::from(value)),
                _ => return Err(ParsingError::InvalidField(s)),
            };
        }
        Ok(passport)
    }
}

/// Parses "byr:1985 ecl:blu hgt:173cm" into [("byr", "1985"), ("ecl", "blu"),
/// ("hgt", "173cm")]
fn field_list(s: &str) -> IResult<&str, Vec<(&str, &str)>> {
    separated_list1(separator, field_value_pair)(s)
}

/// All field/value pairs in a passport entry are separated by either a space or
/// a newline. If that were to change, this function would be the one to be updated.
fn separator(s: &str) -> IResult<&str, &str> {
    alt((tag(" "), tag("\n")))(s)
}

/// Parses "byr:1985" into ("byr", "1985").
fn field_value_pair(s: &str) -> IResult<&str, (&str, &str)> {
    separated_pair(take(3usize), tag(":"), is_not(" \n"))(s)
}

Enter fullscreen mode Exit fullscreen mode

I’ve opted to have a couple of different possible error types just to cover my bases, cases where there are field labels that aren’t in Label get reported as aParsingError::InvalidField(&str) and cases where the parser can’t identify the basic Passport structure (fields and separators) get reported as aParsingError::InvalidPassport(&str). Note the heavy use of our previously implemented From<&str> traits within the From<&str> implementation forPassport.

Close-Out

This was a good exercise for me. Honestly, nom was a bit intimidating to get started with and the documentation could use a bit more work in terms of examples. That’s one of the reasons I decided to write this post, to get more nom examples out into the world. As I started to get more familiar with usage patterns and where different things were in the library, I was better able to chain and nest combinators to do what I wanted. The code above is by no means representative of my first approach, initially I found I was creating overly complicated combinators because I just didn’t know about other options within the library. I will say, though, that taking this approach was much more “de-buggable” than a complex enough regular expression to handle this situation. I suspect that, in the long run, this is a more maintainable approach as well. All in all, worth the time investment to get familiar, and I’m looking forward to practicing more during this year’s Advent of Code.

Top comments (0)