This is probably the most Computer Science heavy episode so far. If that's not your thing, feel free to skip and come back for the next episode.
Deterministic Finite Automaton is a computer science concept. It's a program with these properties:
- at every point it's in one of limited number of states
- it goes through the string one character at a time
- based on current state, current character, and nothing else, it choses the next state
- once it's done with the string, we extract some information from whichever state it ended up in
DFA Example
So for example let's write a program that matches a string of digits. It can have some extra spaces at the beginning and end, and it can have _
between digits (but not elsewhere, an not multiples), and it's not allowed to have leading 0s unless the whole number is zero. In regular expression terms it's /^\s*(0|[1-9]\d*(_\d+)*)\s*$/
.
Let's try to make a DFA for it:
- state
Start
, if characterspace
, go to stateStart
: - state
Start
, if character1-9
, go to stateDigit
: - state
Start
, if character0
, go to stateEnd
: - state
Digit
, if character0-9
, go to stateDigit
- state
Digit
, if characterspace
, go to stateEnd
- state
Digit
, if character_
, go to stateUnderscore
- state
Undersor
, if character0-9
, go to stateDigit
- state
End
, if characterspace
go to stateEnd
Any state, any character not listed, go to state Fail
.
If we reached end of the string, and state is either Digit
or End
, the string matches. Otherwise it doesn't. Hopefully I didn't mess it up.
Other Finite Automata
DFAs are nice because it's obvious how to make them super fast - it's just one table lookup per character. It's also possible to combine them - OR, AND, and NOT of any number of DFAs is still a DFA (even if potentially a much bigger one).
This is all nice, but we usually want to know more than "matches" or "doesn't match" is. So we came up with so damn many variants of the DFA idea - including the engine behind regular expressions.
What we want for Thue is something like that, except:
- we want to know which rule matched
- we want to know where it matched
- we want to know exactly one of all possible matches
So I came up with Randomized Finite Automaton - a DFA variant that's perfect for Thue.
Trie
First, let's organize all the rules into a "trie". Trie is like a tree-like structure which lets us lookup a lot of strings at once. At root of the tree is a hash with keys being all the first characters. Then at every node there's some data for strings that finish there, and more tries for strings that continue.
For example a trie for this collection of strings and heir associated data: {"no": 1, "no": 2, "not": 3, "nu": 4}
would be:
-
root
hasdata: []
, and children{'n': trie_n}
-
trie_n
hasdata: []
, and children{'o': trie_no, 'u': trie_nu}
-
trie_nu
hasdata: [4]
, and children{}
-
trie_no
hasdata: [1, 2]
, and children{"t": trie_not}
-
trie_not
hasdata: [3]
, and children{}
The goal of this is that we can have very big number of rules, and we match them all at once, instead of trying every single rule. If we have thousands of rules, this can be a lot faster than hash table based solution, since trie-based solution can just look at one character and instantly eliminate hundreds or thousands of potential matches, and only the relevant ones stay.
For example if we have this trie, and string "maybe", we do a check for root.children['m']
, root.children['a']
, root.children['y']
, root.children['b']
, root.children['e']
, get empty 5 times, and we're done. No matter how many rules starting with n
or whatnot we had.
I found one Trie implementation for Crystal but it wasn't doing quite what I wanted. I wanted multiple data per node, it had just one. It wouldn't be too hard to adapt, but tries are super simple so I just wrote my own implementation:
class Trie(T)
getter :data
def initialize
@data = [] of T
@children = Hash(Char, Trie(T)).new
end
def insert(str : String, data : T)
if str.empty?
@data.push(data)
else
c = str[0]
@children[c] ||= Trie(T).new
@children[c].insert(str[1..-1], data)
end
end
def [](c : Char)
@children[c]?
end
end
RFA
Now there are two ways to go forward. The first (NFA style), is to remember every partially matched trie. This can potentially mean checking N tries for every character if maximum length of the rule is N.
The other would be to precalculate every combination (DFA style). In principle that would be faster as we guarantee just one lookup per character. The cost would be extra calculation, and potentially a lot bigger tries.
If we expect rules to be fairly short (let's say 10 characters or less), even if there are thousands of rules in our Thue program, then the NFA style solution is just better. If we expect rules to be very long, then DFA solution would win, but I don't think Thue programs would have very big rules.
NFA solution would also be better at ignoring fake rules - like if you use impossible rules as comments (# this is a comment ::= it will never be matched
), NFA solution is pretty much unaffected, while DFA solution would have significantly bigger state.
So here's the Randomized Finite Automaton - the core of this episode:
class RFA
def initialize(@rules : Array(ThueRule))
@trie = Trie(ThueRule).new
@rules.each do |rule|
@trie.insert(rule.left, rule)
end
end
# No empty matches allowed
def random_match(str : String)
count = 0
active = [@trie]
match = nil
str.chars.each_with_index do |char, idx|
next_tries = active.map{|t| t[char]}.compact
matching_rules = next_tries.flat_map(&.data)
unless matching_rules.empty?
count += matching_rules.size
if rand(count) < matching_rules.size
rule = matching_rules.sample
match = {rule: rule, idx: (idx - rule.left.size + 1)}
end
end
active = [@trie, *next_tries]
end
match
end
end
In the constructor we just insert every rule into the main trie.
As we match, we go character by character, and remember every potential trie. That's the main trie, plus any trie which we started matching already. The number of that is bound by length of the longest rule, but in principle there would be very few tries at the same time. (DFA-style solution would have only 1 trie, basically result of merging those NFA tries).
Then we go through all the tries and get all the rules matching at current character.
Now here's the fun part - we could use it to generate list of all possible matches in the string, but that's not what we want, we just want one. So we know we had N matches so far, and M matches at current character. We pick one of M at random, then we roll M / (N + M) to decide if we want to keep new or old match.
The final thing we need to adjust is subtract number of characters minus one from the match. The RFA gives us address of last matching character, but it's generally more convenient to know the first. All rules have fixed number of characters, so it's very easy.
Complete Thue Interpreter
Here's the whole program:
#!/usr/bin/env crystal
require "./trie"
class ThueRule
getter :left
def initialize(@left : String, @right : String)
@right = "~\n" if @right == "~"
end
def apply(str, idx)
before = str[0, idx]
after = str[idx+@left.size .. -1]
if @right[0] == '~'
print @right[1..-1]
replacement = ""
elsif @right == ":::"
replacement = STDIN.gets.not_nil!.chomp
else
replacement = @right
end
before + replacement + after
end
def to_s(io)
io << "Rule<#{@left.inspect}::=#{@right.inspect}>"
end
end
class ThueSideParser
getter :results
@toparse : Array(Char)
def initialize(@str : String)
@results = [""]
@toparse = @str.chars
parse
end
private def parse
until @toparse.empty?
case @toparse[0]
when '['
chars = parse_range
if @results.size == 1
@results = chars.map{|c| @results[0]+c}
elsif @results.size == chars.size
@results = @results.zip(chars).map{|s,c| s+c}
else
raise "Sizes of character classes mismatch in #{@str}"
end
else
c = parse_character
@results = @results.map{|s| s + c}
end
end
@results
end
private def parse_character
if @toparse[0] == '\\'
@toparse.shift
raise "Unmatched \\ in #{@str}" if eos?
c = @toparse.shift
case c
when 'n'
'\n'
when 's'
' '
else
c
end
else
@toparse.shift
end
end
private def parse_range
chars = [] of Char
@toparse.shift
loop do
raise "Character range never closed in #{@str}" if eos?
if @toparse[0] == ']'
@toparse.shift
return chars
end
c = parse_character
raise "Character range never closed in #{@str}" if eos?
if @toparse[0] == '-'
@toparse.shift
e = parse_character
raise "Invalid character range in #{@str}" if e < c
chars.concat(c..e)
else
chars << c
end
end
end
private def eos?
@toparse.empty?
end
end
class ThueRuleParser
def initialize(@str : String)
if @str =~ /\A(.*)::=(.*)\z/
@valid = true
@left = $1
@right = $2
else
@left = ""
@right = ""
@valid = false
end
end
def valid_rule?
@valid
end
def empty_rule?
@valid && @left.empty?
end
def call
lefts = ThueSideParser.new(@left).results
rights = ThueSideParser.new(@right).results
# Support N-to-1 and 1-to-N rules
lefts *= rights.size if lefts.size == 1
rights *= lefts.size if rights.size == 1
unless lefts.size == rights.size
raise "Mismatched side of rule #{@str}"
end
lefts.zip(rights).map do |left, right|
ThueRule.new(left, right)
end
end
end
class RFA
def initialize(@rules : Array(ThueRule))
@trie = Trie(ThueRule).new
@rules.each do |rule|
@trie.insert(rule.left, rule)
end
end
# No empty matches allowed
def random_match(str : String)
count = 0
active = [@trie]
match = nil
str.chars.each_with_index do |char, idx|
next_tries = active.map{|t| t[char]}.compact
matching_rules = next_tries.flat_map(&.data)
unless matching_rules.empty?
count += matching_rules.size
if rand(count) < matching_rules.size
rule = matching_rules.sample
match = {rule: rule, idx: (idx - rule.left.size + 1)}
end
end
active = [@trie, *next_tries]
end
match
end
end
class ThueProgram
def initialize
@rules = [] of ThueRule
@initial = ""
@state = ""
end
def load(path)
lines = File.read_lines(path).map(&.chomp).zip(1..)
while lines.size > 0
line, line_no = lines.shift
# Ignoring invalid rules, they are sometimes used as comments
parser = ThueRuleParser.new(line)
next unless parser.valid_rule?
break if parser.empty_rule?
@rules.concat parser.call
end
@rfa = RFA.new(@rules)
@initial = lines.map(&.first).join("\n")
end
def run(debug)
@state = @initial
if debug
@rules.each do |rule|
STDERR.puts rule
end
end
while match = @rfa.not_nil!.random_match(@state)
rule = match[:rule]
idx = match[:idx]
if debug
STDERR.puts "Applying rule #{rule} at #{idx} to #{@state.inspect}"
end
@state = rule.apply(@state, idx)
end
if debug
STDERR.puts "No more matches. Final state: #{@state.inspect}"
end
end
end
unless ARGV.size == 1
STDERR.puts "Usage: #{PROGRAM_NAME} <file.thue>"
exit 1
end
prog = ThueProgram.new
prog.load(ARGV[0])
# Crystal doesn't handle SIGPIPE well and we want to support:
# crystal thue.cr examples/fizzbuzz.thue | head -n 100
begin
prog.run(!!ENV["DEBUG"]?)
rescue e : IO::Error
exit if e.os_error == Errno::EPIPE
raise e
end
Performance
Doing just this change, we got decent performance improvement, 51s to 21s on 100k FizzBuzz Thue program:
$ time ./thue_rx.cr examples_rx/fizzbuzz.thue | head -n 100000 >/dev/null
./thue_rx.cr examples_rx/fizzbuzz.thue 51.50s user 0.81s system 101% cpu 51.601 total
head -n 100000 > /dev/null 0.16s user 0.39s system 1% cpu 51.590 total
$ time ./thue_rfa.cr examples_rx/fizzbuzz.thue | head -n 100000 >/dev/null
./thue_rfa.cr examples_rx/fizzbuzz.thue 21.47s user 13.90s system 165% cpu 21.418 total
head -n 100000 > /dev/null 0.11s user 0.21s system 1% cpu 21.408 total
Comparing release builds, the difference is consistent but small, 41s to 39s on 500k FizzBuzz Thue program. Both finish 100k FizzBuzz in ~7s:
$ crystal build --release thue_rfa.cr
$ crystal build --release thue_rx.cr
$ time ./thue_rx examples_rx/fizzbuzz.thue | head -n 500000 >/dev/null
./thue_rx examples_rx/fizzbuzz.thue 41.05s user 3.38s system 106% cpu 41.762 total
head -n 500000 > /dev/null 0.45s user 0.88s system 3% cpu 41.760 total
$ time ./thue_rfa examples_rx/fizzbuzz.thue | head -n 500000 >/dev/null
./thue_rfa examples_rx/fizzbuzz.thue 39.44s user 64.53s system 272% cpu 38.119 total
head -n 500000 > /dev/null 0.52s user 0.95s system 3% cpu 38.117 total
I'm not sure if this counts as a win or not. It's very big improvement on the development build, but small one on the release build. It's definitely going to be more significant when running Thue programs with a huge number of rules, I guess FizzBuzz with less than 100 rules didn't really benefit from that.
There's probably a lot of small optimizations that can be applied to RFA#random_match
, even without precomputing a single big trie.
Code
All code examples for the series will be in this repository.
Code for the Better Thue Interpreter in Crystal episode is available here.
Top comments (2)
Really nice post, as usual!
Some small things to further improve the performance:
str.each_char_with_index
instead ofstr.chars.each_with_index
(the former doesn't allocate memory while the latter does)active.compact_map { |t| t[char] }
instead ofactive.map { |t| t[char] }.compact
(one less intermediate array)active.clear; active.push @trie; active.concat next_tries
instead of creating a new array for each charThat said, I totally understand that the goal is to improve performance compared to before while also keeping the code as readable as possible. I'm just suggesting these because you said "There's probably a lot small optimizations" and because I like optimizing things :-)
Oh for sure. I think the biggest performance savings would come from switching from processing characters to processing bytes, as in UTF-8 this works perfectly well without any changes, and then we could just use 256 entry arrays instead of hashes for the trie.
Not like this really makes much difference for programs with so few rules.