Alexandre Donciu-Julin

Posted on Apr 9, 2021

A python regex to validate roman numerals

#python #regex #beginners

Note: This is my first post, I hope you'll like it :)

I'm not gonna lie to you... I LOVE REGEX!

As a kid, I grew up playing adventure games full of puzzles and riddles. Looking for the solution was a personal quest, a treasure hunt. Finding it was so exciting, but not as much as jumping on another riddle!

When I discovered regular expressions on my journey to become a (good) python programmer, I felt that same excitement. I was blown away by the countless possibilities these were offering. Deciphering one was like suddenly being able to read hieroglyph, writing one was like discovering I could speak a foreign language. Although I know they should be used with caution and in special cases only, I keep pushing myself to use them everywhere I can.

That's why, when Codewars challenged me to write a function to convert roman numerals from/to Arabic numbers, I could not resist writing a regex to help me solve that problem.

Enough chit-chat, let's get our hands dirty.

My first task was to validate if the user input was a valid roman numeral. To sum up, roman numerals consists of the following symbols:Source: Wikipedia

It seems that the thousands unit [M] does not extend past [MMM], which means that the biggest roman number would be [MMMCMXCIX], or 3999. I'm not sure if numbers could go higher than that and why the limit, anyway for the sake of this problem I limited myself with numbers between 1 and 3999.

Now the trick is that the symbols placement is very important. If you don't put them in the right order, the resulting number would be invalid and unreadable. As listed in the table up there, numerals should start with thousands [M] 1000, followed by hundreds [D/C] 500/100, then dozens [L/X] 50/10, and finally units [V/1] 5/1.

BUT, that's not it! Numerals can only repeat 3 times like [CCC] 300 before switching to a combo of two numerals like [CD] 400. So you can still have a [C] 100 before an [M] 1000, like in [CM] 900 for instance.

Bit confusing, isn't it?

Alright, let's recap our conditions:

Roman numbers are ranging from [I] 1 to [MMMCMXCIX] 3999
Numerals should follow a precise order: [M] 1000 / [D] 500 / [C] 100 / [L] 50 / [X] 10 / [V] 5 / [I] 1
A numeral cannot repeat more than 3 times, it then uses a pair
The following pairs are allowed: [CM] 900 / [CD] 400 / [XC] 90 / [XL] 40 / [IX] 9 / [IV] 4

Do you start to see our REGEX showing up? :)

Let's translate this into code.

For this we are going to use a tag I find really helpful when writing regex is the verbose one (re.VERBOSE or re.X) It allows you to spread your pattern on multiple lines and be more readable. Let's try it!



import re

def is_roman_number(num):

    pattern = re.compile(r"""   
                                ^M{0,3}
                                (CM|CD|D?C{0,3})?
                                (XC|XL|L?X{0,3})?
                                (IX|IV|V?I{0,3})?$
            """, re.VERBOSE)

    if re.match(pattern, num):
        return True

    return False

Wow, that looks amazing already! Let's take a closer look at these 4 lines:

^M{0,3} = Between 0 and 3 [M] at the beginning [^] of the string
(CM|CD|D?C{0,3})? = One pair [CM] or one pair [CD] or [D], followed by up to 3 [C]. Each element is optional [?], as well as the whole block [()?]
(XC|XL|L?X{0,3})? = One pair [XC] or one pair [XL] or [L], followed by up to 3 [X]. Each element is optional [?], as well as the whole block [()?]
(IX|IV|V?I{0,3})?$ = One pair [IX] or one pair [IV] or [V], followed by up to 3 [I]. Each element is optional [?], as well as the whole block [()?], which should be at the end of the string [$]

Let's test our code

I'm using a simple fstring calling my function and comparing the string against our pattern to validate the numeral or not:




num_valid = 'MMDCCLXXIII'
num_invalid = 'CCCMMVIIVV'

print(f"{num_valid} is {'not' if not is_roman_number(num_valid) else ''}a roman number")
print(f"{num_invalid} is {'not ' if not is_roman_number(num_invalid) else ''}a roman number")

# Output:
# MMDCCLXXIII is a roman number
# CCCMMVIIVV is not a roman number

That wasn't so bad after all! Now look at this and tell me it's not the most beautiful thing you've seen in your life:



^M{0,3}(CM|CD|D?C{0,3})?(XC|XL|L?X{0,3})?(IX|IV|V?I{0,3})?$'

That's all folks! Let me know if you are interested by the second part of the challenge: converting roman numerals from/to Arabic numbers and I will share my solution.

Stay safe out there and read you soon :)

Top comments (6)

Jonatan • Jun 21 '21

Amazing! Only one comment: an empty string will validate to True. Thanks for the article!

Alexandre Donciu-Julin • Jun 23 '21

Hey Jonatan. Thanks for reading my post :)

Sorry but empty strings in Python 2 and 3 are considered False in a boolean context. More infos: stackoverflow.com/questions/957324...

I've been using this property for years, for instance to execute a function only if a path is define:

path = "C:\test"
if path:
    copy_files()
else:
    print("Path not defined")

Cheers

Adam.S • Apr 10 '21

I would certainly like to see the second part. I actually have a small use case. Think Star Wars.

Jessica Alves • Sep 21 '22

Very good explanation, thanks! I came through your article while searching a solution for the same problem on Hackerrank.

Mihh Don • Apr 10 '21

Great post, well explained!

Too bad that in year 4000 we won't have the Roman numbers anymore :(

mrxreborn • Apr 10 '21

Seems like, I'm going to have a nightmare 😰

DEV Community

A python regex to validate roman numerals

I'm not gonna lie to you... I LOVE REGEX!

Enough chit-chat, let's get our hands dirty.

Alright, let's recap our conditions:

Let's translate this into code.

Let's test our code

Top comments (6)

Read next

Building a 🧾 PDF Invoice Generator with 📨 Twilio SendGrid, 🐍 Python, Flask, and 🖌️ Bootstrap

Python Error Types Explained: Troubleshooting for Beginners

14 JavaScript Console Methods for Effective Debugging

Reactive vs. Ref in Vue 3: What’s the difference?