Tomasz Wegrzanowski

Posted on Nov 27, 2021

100 Languages Speedrun: Episode 07: Sed and Regular Expression FizzBuzz

#unix #sed #regex

Time for some regular expressions! Just like with the CSS episode, we won't use them for intended purpose, we'll use them for programming!

Sed

Technically I'll count this as a Sed episode.

Sed is a very simple limited-purpose programming language which supports regular expressions and some really few other things. A sed script, usually a shell one-liner, is run separately on every line of input. It's still used in many shell scripts where one command produces some text, then sed reshuffles it a bit with regular expressions, then another command consumes it.

I don't really approve of this style of programming, and I think you're better off using proper script in Ruby, Python, or Perl pretty much every single scenario - it will be much clearer, actually testable, and very often more concise as well.

And even if you want to just reshuffle some text with regular expressions, Ruby or Perl are better at shell one-liners than Sed, Awk, and such. Perl even comes with sed2perl tool preinstalled (s2p executable), which can convert a sed script to Perl, but its output is quite verbose to match sed quirks accurately.

In any case, we'll use very few Sed features, this episode is all about regular expressions and how to code a FizzBuzz with them. We won't be using any advanced regular expression features, so it will all work in every language with regular expressions.

Hello, world!

Let's start with a hello, world program (-E basically means "don't be broken" and is required for every sed command, -f means to use a file):

#!/usr/bin/sed -Ef

s/.*/Hello, world!/

It will turn every line of input into a "Hello, world!" line:

$ seq 1 5 | ./hello.sed
Hello, world!
Hello, world!
Hello, world!
Hello, world!
Hello, world!

seq A B is a command line utility which generates a sequence of numbers from A to B, we'll be using this a lot.

As an aside, here's equivalent Perl script (-p means to run it for every line, and print the result for every line):

#!/usr/bin/perl -p

s/.*/Hello, world!/

As you can see Perl supports very similar style of programming, and most Sed scripts can be very easily translated into Perl - or in this case just run identically. This isn't a coincidence, one of the goals of Perl was to replace a typical unmaintainable mess resulting from Sed, Awk, and Unix Shell's awkwardly limited control structures, with something that's more or less a real programming language, while remaining extremely concise to use for writing one-liners, and working nicely with Unix shell. We'll definitely get to Perl in some future episode.

Odd or Even

For FizzBuzz seq will give us the numbers, and our regular expressions just need to decide if we should keep it or replace it with Fizz, Buzz, or FizzBuzz.

This might seem like a bit of a challenge, as regular expressions only operate on strings, and very notable don't have any concepts like "divisible by 3" or "divisible by 5".

But we don't need to do any math, we just need to translate those questions about numbers into questions about decimal strings representing those numbers.

For divisibility by 2, the rules are very simple:

any number ending in 0, 2, 4, 6, 8 is divisible by 2 - even
any number ending in 1, 3, 5, 7, 9 is not divisible by 2 - odd

Let's translate this into a Sed script:

#!/usr/bin/sed -Ef

s/.*[02468]$/Even/
s/.*[13579]$/Odd/

And it definitely works:

$ seq 1 12 | ./odd_even.sed
Odd
Even
Odd
Even
Odd
Even
Odd
Even
Odd
Even
Odd
Even

Decimal FizzBuzz

Before we get into the actual FizzBuzz, let's do a simpler version. What if instead of checking divisibility by 3 and 5, it was checking divisibility by 2 and 5?

These are much easier as both 2 and 5 only need checking the final digit:

#!/usr/bin/sed -Ef

s/.*0$/FizzBuzz/
s/.*[2468]$/Fizz/
s/.*5$/Buzz/

And we can verify that it works:

$ seq 1 20 | ./decimal_fizzbuzz.sed
1
Fizz
3
Fizz
Buzz
Fizz
7
Fizz
9
FizzBuzz
11
Fizz
13
Fizz
Buzz
Fizz
17
Fizz
19
FizzBuzz

Divisibility by 3

Divisibility by 3 has a simple rule that can be expressed by operations on digits: "add all digits, if they are divisible by 3, the original number is divisible by 3".

Let's implement this in three steps:

replace every digit 3-9 by 0, 1, or 2 for simplicity
a lot of times (with copy-pasted code, as we have no loops) replace pairs of digits by their sum modulo 3
in the end we'll have one digit, let's replace it by the result

#!/usr/bin/sed -Ef

# Replace numbers by their remainder by 3
s/[369]/0/g
s/[47]/1/g
s/[58]/2/g

# Add pairs of digits, taking only remainder by 3
# Each line reduces number of digits by at least half
# so every number up to 1024 digits works
s/00|12|21/0/g; s/01|10|22/1/g; s/02|20|11/2/g
s/00|12|21/0/g; s/01|10|22/1/g; s/02|20|11/2/g
s/00|12|21/0/g; s/01|10|22/1/g; s/02|20|11/2/g
s/00|12|21/0/g; s/01|10|22/1/g; s/02|20|11/2/g
s/00|12|21/0/g; s/01|10|22/1/g; s/02|20|11/2/g
s/00|12|21/0/g; s/01|10|22/1/g; s/02|20|11/2/g
s/00|12|21/0/g; s/01|10|22/1/g; s/02|20|11/2/g
s/00|12|21/0/g; s/01|10|22/1/g; s/02|20|11/2/g
s/00|12|21/0/g; s/01|10|22/1/g; s/02|20|11/2/g
s/00|12|21/0/g; s/01|10|22/1/g; s/02|20|11/2/g

# Do the final substitution
s/^0$/Divisible by 3/
s/^[12]$/Not divisible by 3/

This works:

$ seq 100000 100010 | ./divisible_by_three.sed
Not divisible by 3
Not divisible by 3
Divisible by 3
Not divisible by 3
Not divisible by 3
Divisible by 3
Not divisible by 3
Not divisible by 3
Divisible by 3
Not divisible by 3
Not divisible by 3

However, it's very inelegant. We'd much prefer to just have a single regular expression. Not to mention that it won't work for very big numbers.

Regular Languages

"Regular" expressions are called that due to their relation with a concept of "regular" languages. Disregarding all the fancy computer science concepts, a "regular" language is like a program with these constraints:

it has no randomness
it starts in some predefined starting state
it has finite number of states
it processes one character at a time, potentially changing to another state
at the end it has to say yes or no

It's easy to see that divisibility by 2 is a regular language:

states are 0 (even), or 1 (odd)
start in state 0
if you see 0, 2, 4, 6, or 8, move to state 0
if you see 1, 3, 5, 7, or 9, move to state 1
if you end up in state 0, answer is yes

But it also works for divisibility by 3:

states are 0 (divisible by three), 1 (remainder one), or 2 (remainder two)
start in state 0
if you see 0, 3, 6, 9, stay where you are
if you see 1, 4, 7, change states 0 to 1, 1 to 2, or 2 to 0
if you see 2, 6, 8, change states 0 to 2, 1 to 0, or 2 to 1
if you end up in state 0, answer is yes

Why does it matter? Every regular language can be translated to a single regular expression. Wikipedia has an algorithm, but I find that article completely unreadable.

Kleene's Algorithm

I think it's easier to show the algorithm that to tell. First we initialize transition table, then we do triple-nested loop over all states:

#!/usr/bin/env ruby

states = [0, 1, 2]
transitions = {
  [0,0] => "(0?)",
  [1,1] => "(0?)",
  [2,2] => "(0?)",
  [0,1] => "1",
  [1,2] => "1",
  [2,0] => "1",
  [0,2] => "2",
  [2,1] => "2",
  [1,0] => "2",
}

states.each do |k|
  new_transitions = {}
  states.each do |i|
    states.each do |j|
      a = transitions[[i,k]]
      b = transitions[[k,k]]
      c = transitions[[k,j]]
      d = transitions[[i,j]]
      new_transitions[[i,j]] = "(#{a}#{b}*#{c}|#{d})"
    end
  end
  transitions = new_transitions
end

pp transitions

Real algorithm would need a few extra checks to deal with nils for impossible transitions, I skipped that stuff as our transition table is fully connected, but that's just a few extra lines when setting new_transitions[[i,j]] (skip left side of |, or right side, or whole thing if there are any nils there).

Divisibility by 3, attempt two

We can now copy and paste regular expressions generated by previous program into our Sed script:

#!/usr/bin/sed -Ef

# Replace numbers by their reminder by 3, to simplify the regular expressions
s/[369]/0/g
s/[47]/1/g
s/[58]/2/g

# Do the final substitution
s/^((((0?)(0?)*1|1)(2(0?)*1|(0?))*(2(0?)*2|1)|((0?)(0?)*2|2))((1(0?)*1|2)(2(0?)*1|(0?))*(2(0?)*2|1)|(1(0?)*2|(0?)))*((1(0?)*1|2)(2(0?)*1|(0?))*(2(0?)*(0?)|2)|(1(0?)*(0?)|1))|(((0?)(0?)*1|1)(2(0?)*1|(0?))*(2(0?)*(0?)|2)|((0?)(0?)*(0?)|(0?))))$/Divisible by 3/
s/^((((0?)(0?)*1|1)(2(0?)*1|(0?))*(2(0?)*2|1)|((0?)(0?)*2|2))((1(0?)*1|2)(2(0?)*1|(0?))*(2(0?)*2|1)|(1(0?)*2|(0?)))*((1(0?)*1|2)(2(0?)*1|(0?))*(2(0?)*1|(0?))|(1(0?)*1|2))|(((0?)(0?)*1|1)(2(0?)*1|(0?))*(2(0?)*1|(0?))|((0?)(0?)*1|1)))$/Remainder 1/
s/^((((0?)(0?)*1|1)(2(0?)*1|(0?))*(2(0?)*2|1)|((0?)(0?)*2|2))((1(0?)*1|2)(2(0?)*1|(0?))*(2(0?)*2|1)|(1(0?)*2|(0?)))*((1(0?)*1|2)(2(0?)*1|(0?))*(2(0?)*2|1)|(1(0?)*2|(0?)))|(((0?)(0?)*1|1)(2(0?)*1|(0?))*(2(0?)*2|1)|((0?)(0?)*2|2)))$/Remainder 2/

And it works perfectly for any number:

$ seq 1000 1010 | ./divisible_by_three_2.sed
Remainder 1
Remainder 2
Divisible by 3
Remainder 1
Remainder 2
Divisible by 3
Remainder 1
Remainder 2
Divisible by 3
Remainder 1
Remainder 2

Well, there's two problems. First, this regular expression looks like a total abomination. And we still have two steps here. So let's modify our Kleene's algorithm to do some simplifications.

Regexp simplification

Kleene's algorithm generates a regular expression that works, but it can be extremely bloated. A fully featured regexp simplifier would be way beyond the scope of this post, but I can manually write down a few patterns we could try before printing the regexp.

#!/usr/bin/env ruby

def simplify(rx)
  rx
    .gsub("0?*", "0*")
    .gsub("0?0*", "0*")
    .gsub("0*0?", "0*")
    .gsub("(0*|0?)", "0*")
    .gsub("(20*|2)", "20*")
    .gsub("(10*|1)", "10*")
    .gsub("(20*1|0?)*", "(20*1|0)*")
    .gsub("(0*1|1)", "0*1")
    .gsub("(0*2|2)", "0*2")
end

states = [0, 1, 2]
transitions = {
  [0,0] => "0?",
  [1,1] => "0?",
  [2,2] => "0?",
  [0,1] => "1",
  [1,2] => "1",
  [2,0] => "1",
  [0,2] => "2",
  [2,1] => "2",
  [1,0] => "2",
}

states.each do |k|
  new_transitions = {}
  states.each do |i|
    states.each do |j|
      a = transitions[[i,k]]
      b = transitions[[k,k]]
      c = transitions[[k,j]]
      d = transitions[[i,j]]
      new_transitions[[i,j]] = simplify("(#{a}#{b}*#{c}|#{d})")
    end
  end
  transitions = new_transitions
end

pp transitions

Divisibility by 3, attempt three

With somewhat simplified regexps, the code looks a bit better, but there are still two steps:

#!/usr/bin/sed -Ef

# Replace numbers by their reminder by 3, to simplify the regular expressions
s/[369]/0/g
s/[47]/1/g
s/[58]/2/g

# Do the final substitution
s/^((0*1(20*1|0)*(20*2|1)|0*2)((10*1|2)(20*1|0)*(20*2|1)|(10*2|0?))*((10*1|2)(20*1|0)*20*|10*)|(0*1(20*1|0)*20*|0*))$/Divisible by 3/
s/^((0*1(20*1|0)*(20*2|1)|0*2)((10*1|2)(20*1|0)*(20*2|1)|(10*2|0?))*((10*1|2)(20*1|0)*(20*1|0?)|(10*1|2))|(0*1(20*1|0)*(20*1|0?)|0*1))$/Remainder 1/
s/^((0*1(20*1|0)*(20*2|1)|0*2)((10*1|2)(20*1|0)*(20*2|1)|(10*2|0?))*((10*1|2)(20*1|0)*(20*2|1)|(10*2|0?))|(0*1(20*1|0)*(20*2|1)|0*2))$/Remainder 2/

Divisible by 3, attempt four

There's no mystery what to do next. We just need to replace every reference to 0, 1, and 2, by respectively [0369], [147], and [258].

#!/usr/bin/sed -Ef

s/^(([0369]*[147]([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|[0369]*[258])(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|([147][0369]*[258]|[0369]?))*(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*[258][0369]*|[147][0369]*)|([0369]*[147]([258][0369]*[147]|[0369])*[258][0369]*|[0369]*))$/Divisible by 3/
s/^(([0369]*[147]([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|[0369]*[258])(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|([147][0369]*[258]|[0369]?))*(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*([258][0369]*[147]|[0369]?)|([147][0369]*[147]|[258]))|([0369]*[147]([258][0369]*[147]|[0369])*([258][0369]*[147]|[0369]?)|[0369]*[147]))$/Remainder 1/
s/^(([0369]*[147]([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|[0369]*[258])(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|([147][0369]*[258]|[0369]?))*(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|([147][0369]*[258]|[0369]?))|([0369]*[147]([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|[0369]*[258]))$/Remainder 2/

FizzBuzz

And finally we're reaching our original goal. We could run Kleene's algorithm on a 15 state automaton, but we can do something simpler:

if last digit is 0, and everything before is divisible by 3, then FizzBuzz
if last digit is 5, and everything before has remainder 1 modulo 3, then FizzBuzz
if last digit is 0 or 5, in every other case, Buzz
if everything is divisible by 3, then Fizz
else don't substitute

#!/usr/bin/sed -Ef

s/^(([0369]*[147]([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|[0369]*[258])(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|([147][0369]*[258]|[0369]?))*(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*[258][0369]*|[147][0369]*)|([0369]*[147]([258][0369]*[147]|[0369])*[258][0369]*|[0369]*))0$/FizzBuzz/
s/^(([0369]*[147]([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|[0369]*[258])(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|([147][0369]*[258]|[0369]?))*(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*([258][0369]*[147]|[0369]?)|([147][0369]*[147]|[258]))|([0369]*[147]([258][0369]*[147]|[0369])*([258][0369]*[147]|[0369]?)|[0369]*[147]))5$/FizzBuzz/
s/^.*[05]$/Buzz/
s/^(([0369]*[147]([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|[0369]*[258])(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*([258][0369]*[258]|[147])|([147][0369]*[258]|[0369]?))*(([147][0369]*[147]|[258])([258][0369]*[147]|[0369])*[258][0369]*|[147][0369]*)|([0369]*[147]([258][0369]*[147]|[0369])*[258][0369]*|[0369]*))$/Fizz/

And you can move the first two rules into a s/^(A|B)$/FizzBuzz/ if you want to reduce the number of rules.

How can you learn regular expressions?

I highly recommend Regex Crossword. From what I've seen it takes only about 1-2 hours to go from having no idea about regular expressions to having pretty decent understanding of their how they work, and this is going to repay really quickly.