A search engine - Part 2: Data pipelines

#go #datastructures #algorithms #computerscience

This part of my series deals with the obstacles introduces in the first part.
We already have our basic indexing logic set up, but I'll show you the drawbacks that we introduced.

Currently, we have the following data:

Cache: ["Mama, just killed a man"
       "Put a gun against his head, pulled my trigger, now he's dead"] 
Index: map[Mama:[0] Put:[1] a:[0 1] against:[1] 
       dead:[1] gun:[1] he:[1] head:[1] his:[1] just:[0] killed:[0] 
       man:[0] my:[1] now:[1] pulled:[1] s:[1] trigger:[1]]

Some major drawbacks in the data here include:

Different character casing - this is bad because Mama is different to mama and therefore is identified as another string and cannot be found
Small pieces - You see something like a, he and his in the index map. No one will ever search for these terms because they are so universal and just mess up our memory. They are also called stopwords, and we should cut them out.

So we have to add some additional code to our Add() method.

// Add appends the provided string to the cache and updates the index accordingly
func (g *GlSearch) Add(s string) {
    i := len(g.cache) // the zero-based position our next entry in the cache will take

    g.cache = append(g.cache, s)
    // transform our string to build up a useful index
    // split in words using space characters
    words := split(s, g.config.Seperators)

    // remove words that occur very often (stopwords, e.g.: the, now, to, ...)
    // to improve index quality and minimize memory needs
    filterWords(&words, g.config.KeepFunc, g.config.TransformFunc)

    // update the index for every word
    for _, v := range words {

        // append or create if it does not exist already
        if e, ok := g.index[v]; !ok {
            e = []int{i}
            g.index[v] = e
        } else {
            e = append(e, i)
            g.index[v] = e
        }
    }

    fmt.Printf("Cache: %q \nIndex: %v\n", g.cache, g.index)
}

After splitting up and adding it to our index we'll transform our words slice further by removing the stopwords and adjusting casing to satisfy our needs.
In this filterWords() method we can work on our string slice directly to not allocate any additional memory.
But we'll have to use pointers when we want to cut the slice in length.

Pro tip - read more about it in this post

func filterWords(p *[]string, keep FilterFunc, transform TransformFunc) {

    // we have to use pointers here
    // to make the removal of our entries work
    s := *p

    n := 0
    for _, x := range s {
        // keep and transform are defined within the Config struct
        if keep(x) {
            s[n] = transform(x)
            n++
        }
    }
    s = s[:n]

    *p = s
}

Now, after including some steps in our data pipeline in the Add() method, our example data from before looks like this:

Before:
Cache: ["Mama, just killed a man"
       "Put a gun against his head, pulled my trigger, now he's dead"] 
Index: map[Mama:[0] Put:[1] a:[0 1] against:[1]
       dead:[1] gun:[1] he:[1] head:[1] his:[1] just:[0] killed:[0]
       man:[0] my:[1] now:[1] pulled:[1] s:[1] trigger:[1]]

After:
Cache: ["Mama, just killed a man"
       "Put a gun against his head, pulled my trigger, now he's dead"] 
Index: map[against:[1] dead:[1] gun:[1] head:[1]
       his:[1] just:[0] killed:[0] mama:[0] man:[0]
       now:[1] pulled:[1] put:[1] trigger:[1]]

As you can see, everything is lowercase and the words which are less than three characters long are removed.

We achieved this by defining some additional func parameters in our Config struct.

The below shown type definitions are used in the filterWords() method.

// FilterFunc checks if a part in a string should be kept or not
type FilterFunc func(part string) bool

// TransformFunc is responsible for transforming input strings into a better and normalized format
type TransformFunc func(part string) string

The FilterFunc takes a string and checks if it should be kept or not.
It returns true to keep an entry and false to drop it.

The second method, TransformFunc modifys every word into a normalized format for our index.
In our case it's just setting the character casing to lowercase.
We'll also use this in our search method to ensure that we extract the exact same representation of our words to be searched.

In our default configuration, it is defined like this:

// DefaultConfig is the default configuration for GlSearch
var DefaultConfig = Config{
  Seperators: []rune(" ,.!/'"),
  KeepFunc: func(part string) bool {
    if len(part) < 3 {
      return false
    }

    return true
  },
  TransformFunc: func(part string) string {
    return strings.ToLower(part)
  },
}

Now everything is prepared to finally search through our index.
Part three of this series will cover this topic.

Stay tuned

DEV Community

A search engine - Part 2: Data pipelines

Top comments (0)

Read next

Must Know Algorithms

2337. Move Pieces to Obtain a String

Python Django Financial Trading Dashboard — Integrating AnyChart JS Charts

FETCHING ADVENT OF CODE INPUT DYNAMICALLY IN GO