DEV Community

loading...
Cover image for A python regex to validate roman numerals

A python regex to validate roman numerals

alexdjulin profile image Alexandre Donciu-Julin ・4 min read

Note: This is my first post, I hope you'll like it :)

I'm not gonna lie to you... I LOVE REGEX!

As a kid, I grew up playing adventure games full of puzzles and riddles. Looking for the solution was a personal quest, a treasure hunt. Finding it was so exciting, but not as much as jumping on another riddle!

When I discovered regular expressions on my journey to become a (good) python programmer, I felt that same excitement. I was blown away by the countless possibilities these were offering. Deciphering one was like suddenly being able to read hieroglyph, writing one was like discovering I could speak a foreign language. Although I know they should be used with caution and in special cases only, I keep pushing myself to use them everywhere I can.

That's why, when Codewars challenged me to write a function to convert roman numerals from/to Arabic numbers, I could not resist writing a regex to help me solve that problem.

Alt Text

Enough chit-chat, let's get our hands dirty.

My first task was to validate if the user input was a valid roman numeral. To sum up, roman numerals consists of the following symbols:imageSource: Wikipedia

It seems that the thousands unit [M] does not extend past [MMM], which means that the biggest roman number would be [MMMCMXCIX], or 3999. I'm not sure if numbers could go higher than that and why the limit, anyway for the sake of this problem I limited myself with numbers between 1 and 3999.

Now the trick is that the symbols placement is very important. If you don't put them in the right order, the resulting number would be invalid and unreadable. As listed in the table up there, numerals should start with thousands [M] 1000, followed by hundreds [D/C] 500/100, then dozens [L/X] 50/10, and finally units [V/1] 5/1.

BUT, that's not it! Numerals can only repeat 3 times like [CCC] 300 before switching to a combo of two numerals like [CD] 400. So you can still have a [C] 100 before an [M] 1000, like in [CM] 900 for instance.

Bit confusing, isn't it?
image

Alright, let's recap our conditions:

  • Roman numbers are ranging from [I] 1 to [MMMCMXCIX] 3999
  • Numerals should follow a precise order: [M] 1000 / [D] 500 / [C] 100 / [L] 50 / [X] 10 / [V] 5 / [I] 1
  • A numeral cannot repeat more than 3 times, it then uses a pair
  • The following pairs are allowed: [CM] 900 / [CD] 400 / [XC] 90 / [XL] 40 / [IX] 9 / [IV] 4

Do you start to see our REGEX showing up? :)

Let's translate this into code.

For this we are going to use a tag I find really helpful when writing regex is the verbose one (re.VERBOSE or re.X) It allows you to spread your pattern on multiple lines and be more readable. Let's try it!

import re

def is_roman_number(num):

    pattern = re.compile(r"""   
                                ^M{0,3}
                                (CM|CD|D?C{0,3})?
                                (XC|XL|L?X{0,3})?
                                (IX|IV|V?I{0,3})?$
            """, re.VERBOSE)

    if re.match(pattern, num):
        return True

    return False
Enter fullscreen mode Exit fullscreen mode

Wow, that looks amazing already! Let's take a closer look at these 4 lines:

  • ^M{0,3} = Between 0 and 3 [M] at the beginning [^] of the string
  • (CM|CD|D?C{0,3})? = One pair [CM] or one pair [CD] or [D], followed by up to 3 [C]. Each element is optional [?], as well as the whole block [()?]
  • (XC|XL|L?X{0,3})? = One pair [XC] or one pair [XL] or [L], followed by up to 3 [X]. Each element is optional [?], as well as the whole block [()?]
  • (IX|IV|V?I{0,3})?$ = One pair [IX] or one pair [IV] or [V], followed by up to 3 [I]. Each element is optional [?], as well as the whole block [()?], which should be at the end of the string [$]

Let's test our code

I'm using a simple fstring calling my function and comparing the string against our pattern to validate the numeral or not:


num_valid = 'MMDCCLXXIII'
num_invalid = 'CCCMMVIIVV'

print(f"{num_valid} is {'not' if not is_roman_number(num_valid) else ''}a roman number")
print(f"{num_invalid} is {'not ' if not is_roman_number(num_invalid) else ''}a roman number")

# Output:
# MMDCCLXXIII is a roman number
# CCCMMVIIVV is not a roman number
Enter fullscreen mode Exit fullscreen mode

That wasn't so bad after all! Now look at this and tell me it's not the most beautiful thing you've seen in your life:

^M{0,3}(CM|CD|D?C{0,3})?(XC|XL|L?X{0,3})?(IX|IV|V?I{0,3})?$'
Enter fullscreen mode Exit fullscreen mode

Alt Text

That's all folks! Let me know if you are interested by the second part of the challenge: converting roman numerals from/to Arabic numbers and I will share my solution.

Stay safe out there and read you soon :)

Discussion (3)

pic
Editor guide
Collapse
basman profile image
Adam.S

I would certainly like to see the second part. I actually have a small use case. Think Star Wars.

Collapse
mihhdon profile image
Mihh Don

Great post, well explained!

Too bad that in year 4000 we won't have the Roman numbers anymore :(

Collapse
mrxreborn profile image
mrxreborn

Seems like, I'm going to have a nightmare 😰