This year was my first participating in Advent of Code—and I’m glad I did, because solving one of the challenges exposed me to an excellent data validation library for Python named Cerberus.
What’s in a valid passport
Below are some excerpts from the challenge, along with specific field level validation rules:
You arrive at the airport only to realize that you grabbed your North Pole Credentials instead of your passport. While these documents are extremely similar, North Pole Credentials aren’t issued by a country and therefore aren’t actually valid documentation for travel in most of the world.
It seems like you’re not the only one having problems, though; a very long line has formed for the automatic passport scanners, and the delay could upset your travel itinerary.
…
The line is moving more quickly now, but you overhear airport security talking about how passports with invalid data are getting through. Better add some data validation, quick!
You can continue to ignore the
cid
field, but each other field has strict rules about what values are valid for automatic validation:
byr
(Birth Year) - four digits; at least1920
and at most2002
.iyr
(Issue Year) - four digits; at least2010
and at most2020
.eyr
(Expiration Year) - four digits; at least2020
and at most2030
.hgt
(Height) - a number followed by eithercm
orin
:
- If
cm
, the number must be at least150
and at most193
.- If
in
, the number must be at least59
and at most76
.hcl
(Hair Color) - a # followed by exactly six characters0-9
ora-f
.ecl
(Eye Color) - exactly one of:amb
blu
brn
gry
grn
hzl
oth
.pid
(Passport ID) - a nine-digit number, including leading zeroes.cid
(Country ID) - ignored, missing or not.Your job is to count the passports where all required fields are both present and valid according to the above rules.
For completeness, here are some invalid passports (delimited by \n\n
):
eyr:1972 cid:100
hcl:#18171d ecl:amb hgt:170 pid:186cm iyr:2018 byr:1926
iyr:2019
hcl:#602927 eyr:1967 hgt:170cm
ecl:grn pid:012533040 byr:1946
hcl:dab227 iyr:2012
ecl:brn hgt:182cm pid:021572410 eyr:2020 byr:1992 cid:277
And, some valid passports:
pid:087499704 hgt:74in ecl:grn iyr:2012 eyr:2030 byr:1980
hcl:#623a2f
eyr:2029 ecl:blu cid:129 byr:1989
iyr:2014 pid:896056539 hcl:#a97842 hgt:165cm
hcl:#888785
hgt:164cm byr:2001 iyr:2015 cid:88
pid:545766238 ecl:hzl
eyr:2022
Most of the validation rules look straightforward in isolation, but less so when you think about composing them all together.
Validating passports with Cerberus
Step one involved getting familiar with Cerberus validation rules. The library supports rules like the following:
-
contains
- This rule validates that the a container object contains all of the defined items.
>>> document = {"states": ["peace", "love", "inity"]}
>>> schema = {"states": {"contains": "peace"}}
>>> v.validate(document, schema)
True
-
regex
- The validation will fail if the field’s value does not match the provided regular expression.
>>> schema = {
... "email": {
... "type": "string",
... "regex": "^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"
... }
... }
>>> document = {"email": "john@example.com"}
>>> v.validate(document, schema)
True
-
required
- IfTrue
the field is mandatory. Validation will fail when it is missing.
>>> v.schema = {"name": {"required": True, "type": "string"}, "age": {"type": "integer"}}
>>> document = {"age": 10}
>>> v.validate(document)
False
Step two involved converting the passports into Cerberus documents. This was mostly an exercise in parsing uniquely assembled text into Python dictionaries.
# Split the batch file records by double newline.
for record in batch_file.read().split("\n\n"):
# Split the fields within a record by a space or newline.
record_field_list = [
tuple(field.split(":")) for field in re.compile(r"\s").split(record.strip())
]
That leaves record_field_list
looking like:
>>> record_field_list
[('ecl', 'gry'),
('pid', '860033327'),
('eyr', '2020'),
('hcl', '#fffffd'),
('byr', '1937'),
('iyr', '2017'),
('cid', '147'),
('hgt', '183cm')]
From there, dict
converts the list of tuples into a proper Cerberus document:
>>> document = dict(record_field_list)
>>> document
{'byr': '1937',
'cid': '147',
'ecl': 'gry',
'eyr': '2020',
'hcl': '#fffffd',
'hgt': '183cm',
'iyr': '2017',
'pid': '860033327'}
Putting it all together
Equipped with a better understanding of what’s possible with Cerberus, and a list of Python dictionaries representing passports, below is the schema I put together to enforce the passport validation rules of the challenge. Only one of the rules (hgt
) required a custom function (compare_hgt_with_units
).
SCHEMA = {
"byr": {"min": "1920", "max": "2002"},
"iyr": {"min": "2010", "max": "2020"},
"eyr": {"min": "2020", "max": "2030"},
"hgt": {
"anyof": [
{"allof": [{"regex": "[0-9]+cm"}, {"check_with": compare_hgt_with_units}]},
{"allof": [{"regex": "[0-9]+in"}, {"check_with": compare_hgt_with_units}]},
]
},
"hcl": {"regex": "#[0-9a-f]{6}"},
"ecl": {"allowed": ["amb", "blu", "brn", "gry", "grn", "hzl", "oth"]},
"pid": {"regex": "[0-9]{9}"},
"cid": {"required": False},
}
# Provide a custom field validation function for a height with units.
def compare_hgt_with_units(field: str, value: str, error: Callable[..., str]) -> None:
if value.endswith("cm"):
if not (150 <= int(value.rstrip("cm")) <= 193):
error(field, "out of range")
elif value.endswith("in"):
if not (59 <= int(value.rstrip("in")) <= 76):
error(field, "out of range")
else:
error(field, "missing units")
With a schema in place, all that’s left to do is instantiate a Validator
and validate each document:
>>> v = Validator(SCHEMA, require_all=True)
>>> v.validate(document)
True
Thanks, Cerberus!
Top comments (0)