DEV Community

Cover image for Creating an efficient SpellChecker in Elixir
Byron Salty
Byron Salty

Posted on • Edited on

Creating an efficient SpellChecker in Elixir

Goal

To efficiently use a large list of known English words to create a spell checker.

NOTE This is a work in progress. Please respond with comments on how it could be done better / if there are existing frameworks or libraries that I should use instead.


Plan

In this article I'll create two versions of a simple spell checker which really will just take a word and see if it's in a large list of known words. I'm using this list of 20,000 words, which is supposed to be a list of the most frequent ones.

I plan to use this in a Phoenix app so I wanted to do some measurements and determine the performance characteristics. My first thought was to use a GenServer so that I could have a single process load up the list and answer check requests, but I wanted to make sure overhead of a GenServer is worth it.


Code Structure

I'm creating two different SpellCheckers - a Naive one and a GenServer one.

I'm going to use a page from Bram Stoker's Dracula as test data.

And then I'll use Benchee to measure results.


Implementations

Helper:

defmodule FileReader do
  def read_words_from_file(file_path) do
    file_path
    |> File.stream!()
    |> Enum.map(&String.trim/1)
    |> Enum.reject(&(&1 == ""))
  end
end
Enter fullscreen mode Exit fullscreen mode

Naive:

defmodule SpellCheck.Naive do

  #@all_words ["these", "are", "some", "words"]

  def exists?(word) do
    all_words = FileReader.read_words_from_file("/tmp/20k.txt")

    word in all_words
  end
end
Enter fullscreen mode Exit fullscreen mode

GenServer:

defmodule SpellCheck.GenServer do
  use GenServer

  #@all_words ["these", "are", "some", "words"]

  def start_link(_opts \\ []) do
    GenServer.start_link(__MODULE__, :ok, name: __MODULE__)
  end

  def exists?(word) do
    GenServer.call(__MODULE__, {:check_exists, word})
  end

  @impl true
  def init(_) do
    all_words = FileReader.read_words_from_file("/tmp/20k.txt")
    {:ok, all_words}
  end


  @impl true
  def handle_call({:check_exists, word}, _from, all_words) do
    {:reply, word in all_words, all_words}
  end
end

SpellCheck.GenServer.start_link()
Enter fullscreen mode Exit fullscreen mode

Test words:

sample_text = """
I soon lost sight and recollection of ghostly fears
<snip>
"""

eval_words = String.split(sample_text, " ", trim: true)
Enter fullscreen mode Exit fullscreen mode

Benchee:

Benchee.run(%{
    "Naive" => fn -> Enum.map(eval_words, fn w -> SpellCheck.Naive.exists?(w) end) end,
    "GenServer" => fn -> Enum.map(eval_words, fn w -> SpellCheck.GenServer.exists?(w) end) end
  },
  time: 5, 
  memory_time: 2
  )
Enter fullscreen mode Exit fullscreen mode

Test 1 - Hard coded dictionary

First, I actually tested the code without reading the 20,000 words from a file and just used an array set as module attributes.

Perhaps not surprisingly, when using a hard coded list of words as a dictionary the extra overhead of a GenServer was noticable and the Naive implementation performed better:

Name                ips        average  deviation         median         99th %
Naive           14.17 K       70.57 μs    ±37.68%       69.79 μs       87.25 μs
GenServer        2.99 K      333.95 μs     ±3.30%      330.88 μs      374.93 μs

Comparison: 
Naive           14.17 K
GenServer        2.99 K - 4.73x slower +263.38 μs

Memory usage statistics:

Name         Memory usage
Naive           171.97 KB
GenServer       246.55 KB - 1.43x memory usage +74.58 KB
Enter fullscreen mode Exit fullscreen mode

Interpreting the results: Naive was 5x faster and used ~60% less memory.

But this is not a viable way to check spelling because we need to load a large set of words.

How will it perform at higher scale?


Test 2 - Reading dictionary from a file

You won't be surprised to see that the GenServer performs much better, considering that their purpose is maintaining state so we don't need to load the state over and over.

However, I was surprised to see HOW MUCH better it was:

Name                ips        average  deviation         median         99th %
GenServer        110.18      0.00908 s    ±55.16%      0.00876 s      0.00981 s
Naive              0.68         1.46 s     ±6.78%         1.41 s         1.61 s

Comparison: 
GenServer        110.18
Naive              0.68 - 160.88x slower +1.45 s

Memory usage statistics:

Name         Memory usage
GenServer      0.00024 GB
Naive             1.98 GB - 8419.39x memory usage +1.98 GB

Enter fullscreen mode Exit fullscreen mode

Interpreting the results: GenServer was 160x faster and a tiny fraction of the memory that Naive used. In fact, it looks like GenServer's memory usage didn't change much between tests.

Though the overall throughput between the two tests varies significantly. My method of checking if a word is in the dictionary could certainly be improved with a data structure that supports a binary search instead of a linear search. But this is good enough for my use case and I'll need to explore faster searching another time.

Top comments (0)