loading...

Exercise: Making a Simple Regex Engine

aminarria profile image Amin Arria ・4 min read

I've started working for Lambdaclass and getting to know Erlang. As preparation for interviews with potential clients I got an assignment based on this article: Build a Regex Engine in Less than 40 Lines of Code. So now I need to try and do it in Erlang, let's see how it goes.

The Task

Create a function (match/2) that will receive a string and a regex (also a string), and return true if it matches or false otherwise. You won't be using all of Regex, just a small subset of it as follows:

Syntax Meaning Example Matches
a match the specified character k k
. matches any character (except newline) . a,b,c,d...
? matches 0 or 1 of the previous character aq? a, aq
* matches 0 or more of the previous character b* "", b, bb, bbb
^ matches the start of a string ^ca ca
$ matches the end of a string eb$ eb

Considerations

Before diving into making a solution I needed to establish certain considerations that I took into account when designing the solution.

  • The given regex won't be invalid (Eg: "^*", "*?")
  • I can ignore the existance of the newline character since I expect simple strings as inputs. Thus the character . matches anything
  • An empty regex matches nothing so match(SomeString, EmptyRegex) will return false
  • match/2 can return true as soon it matches something since I don't care about where or what.

Solving the Problem

First step to making a solution was to create the base cases. An empty string or and an empty regex can't match anything.

-module(regex).

-export([match/2]).

match([], _Regex) ->
  false;
match(_String, []) ->
  false;

Next step was to match same characters, something simple thanks to pattern matching:

match([Char | String], [Char | Regex]) ->
  matching(String, Regex);
match([_Char | _String], [_OtherChar | _Regex]) ->
  false.

Doing this I figured that I already had made a mistake, my match/2 function would match correctly things like match("hi", "hi"), but would fail matching match("oh hi", "hi"). I needed a way to try and match the string with the regex trying every single character in the string as a starting point. For this I modified match/2 to call a private matching/2 function. If it succeeded match/2 would return true, but if it failed I will try it again but dropping the first character of the string.

match(String, Regex) ->
  case matching(String, Regex) of
    true ->
      true;
    false ->
      match(tl(String), Regex)
  end.

matching/2 will work by consuming the regex and string. Whenever there is no more regex a match has been found. I'll add this clauses and modify the existing match of characters.

matching(_String, []) ->
  true;
matching([], _Regex) ->
  false;

matching([Char | String], [Char | Regex]) ->
  matching(String, Regex);
matching([_Char | _String], [_OtherChar | _Regex]) ->
  false.

For the . I just had to had a clause to match any character in the string against it and keep on matching.

matching([_Char | String], [%. | Regex]) ->
  matching(String, Regex);

Next step was to tackle the remaining special characters. I supposed the ^ and $ would be easier, so them first. For the ^ I only needed to add a clause to match/2 to check for it and return whatever matching/2 returns, can't keep trying. For the $ I would only need to check if the string is empty and return true or if not empty return false.

match(String, [$^ | Regex]) ->
  matching(String, Regex);

...

matching([], [$$]) ->
  true;
matching(_String, [$$]) ->
  false;

Now starts the hard part, handling ? and *. As always I think going from simple to dificult is best so let's start with ?. I need to consume both a string's and a regex's character when they match, but when they don't only consume the regex's character.

matching([Char | String], [Char | [$? | Regex]]) ->
  matching(String, Regex);
matching(String, [_OtherChar | [$? | Regex]]) ->
  matching(String, Regex);

Not the most beautiful pattern matching, but it works. The thing is this leaves us with an issue: Now we can have no string left and still match successfully (Eg: match("hi", "hio?"). To fix this I'll add a clause checking if ? is the next character and continue matching if that's the case.

matching([], [_Char | [$? | Regex]]) ->
  matching([], Regex);
matching([], _Regex) ->
  false;

For * its pretty much the same except when it matches we don't go ahead an consume it, instead we keep calling it until it doesn't match. Like before I'll also run into the issue of not having any more string so I need to add a clause for that as well.

One last problem: .*. This combination will match against every character and prevent matches like match("hello", ".*o"). This was really hard to figure a good way to deal with. Eventually I realized that since .* would match against nothing or everything I could just do like match/2 and use the same process of trying and consuming the string.

matching(String, [$. | [$* | RemRegex]] = Regex) ->
  case matching(String, RemRegex) of
    true ->
      true;
    false ->
      matching(tl(String), Regex)
  end;

The End

Finally after quite a struggle (and more than 40 lines of code) I managed to finish a simple regex engine. It was really interesting thinking about edge cases, how to best pattern match, and how to order the function clauses. I leave a snippet of the whole final module below.

PS: If you are interested in learning Erlang with some exercises check out Erlings

-module(regex).

-export([match/2]).

match([], _Regex) ->
  false;
match(_String, []) ->
  false;
match(String, [$^ | Regex]) ->
  matching(String, Regex);
match(String, Regex) ->
  case matching(String, Regex) of
    true ->
      true;
    false ->
      match(tl(String), Regex)
  end.

matching(_String, []) ->
  true;

matching([], [$$]) ->
  true;
matching([], [_Char | [$? | Regex]]) ->
  matching([], Regex);
matching([], [_Char | [$* | Regex]]) ->
  matching([], Regex);
matching([], _Regex) ->
  false;

matching(_String, [$$]) ->
  false;

matching(String, [$. | [$* | RemRegex]] = Regex) ->
  case matching(String, RemRegex) of
    true ->
      true;
    false ->
      matching(tl(String), Regex)
  end;

matching([Char | String], [Char | [$? | Regex]]) ->
  matching(String, Regex);
matching(String, [_OtherChar | [$? | Regex]]) ->
  matching(String, Regex);

matching([Char | String], [Char | [$* | _RemRegex]] = Regex) ->
  matching(String, Regex);
matching(String, [_OtherChar | [$* | Regex]]) ->
  matching(String, Regex);

matching([_Char | String], [$. | Regex]) ->
  matching(String, Regex);
matching([Char | String], [Char | Regex]) ->
  matching(String, Regex);
matching([_Char | _String], [_OtherChar | _Regex]) ->
  false.

Posted on by:

Discussion

pic
Editor guide
 

<3 for using Erlang. Although it may look strange people using C-Style syntax or other scripting languages, Erlang and the Erlang VM (BEAM) has a ton of great features and concepts that gets slowly added to other more popular languages.

I also love RegExs so that was a great read for me :)

You might mention, that most actual productive implementation do compile a regular expression on runtime to executable code for performance reasons.

 

Hey, nice job.

Actually, you can fit it in less than 40 lines just reordering some of the clauses and using the fact that:

  case matching(Something) of
    true ->
      true;
    false ->
      matching(SomethingElse) 
  end

is equivalent to:

   matching(Something) orelse matching(SomethingElse)

since matching/2 returns a boolean.

Here's my (I believe equivalent) version:
(I used search and match names since they are similar to how python re module works, where search looks anywhere in the string and match has to match from the beginning of the string)

-module(regex).

-export([search/2, match/2]).

search(String, [$^ | Regex]) ->
  match(String, Regex);
search([Char | String], Regex = [_ | _]) ->
  search(String, Regex) orelse match([Char | String], Regex);
search(_String, _Regex) ->
  false.

match(_String, []) ->
  true;

match([], [$$]) ->
  true;

match(String, [Char | [$? | Regex]]) ->
  match(String, Regex) orelse match(String, [Char | Regex]);

match([_Char | String], [$. | [$* | Regex]]) ->
  match(String, Regex) orelse match(String, [$. | [$* | Regex]]);
match([Char | String], [Char | [$* | Regex]]) ->
  match(String, Regex) orelse match(String, [Char | [$* | Regex]]);
match(String, [_Char | [$* | Regex]]) ->
  match(String, Regex);

match([_Char | String], [$. | Regex]) ->
  match(String, Regex);

match([Char | String], [Char | Regex]) ->
  match(String, Regex);

match(String, [$^ | Regex]) ->
  match(String, Regex);

match(_String, _Regex) ->
  false.

Haven't tested a lot of cases, but seems to be working. :)

 

Nice catch!