This part of my series deals with the obstacles introduces in the first part.
We already have our basic indexing logic set up, but I'll show you the drawbacks that we introduced.
Currently, we have the following data:
Cache: ["Mama, just killed a man"
"Put a gun against his head, pulled my trigger, now he's dead"]
Index: map[Mama:[0] Put:[1] a:[0 1] against:[1]
dead:[1] gun:[1] he:[1] head:[1] his:[1] just:[0] killed:[0]
man:[0] my:[1] now:[1] pulled:[1] s:[1] trigger:[1]]
Some major drawbacks in the data here include:
-
Different character casing - this is bad because
Mama
is different tomama
and therefore is identified as another string and cannot be found -
Small pieces - You see something like
a
,he
andhis
in the index map. No one will ever search for these terms because they are so universal and just mess up our memory. They are also called stopwords, and we should cut them out.
So we have to add some additional code to our Add()
method.
// Add appends the provided string to the cache and updates the index accordingly
func (g *GlSearch) Add(s string) {
i := len(g.cache) // the zero-based position our next entry in the cache will take
g.cache = append(g.cache, s)
// transform our string to build up a useful index
// split in words using space characters
words := split(s, g.config.Seperators)
// remove words that occur very often (stopwords, e.g.: the, now, to, ...)
// to improve index quality and minimize memory needs
filterWords(&words, g.config.KeepFunc, g.config.TransformFunc)
// update the index for every word
for _, v := range words {
// append or create if it does not exist already
if e, ok := g.index[v]; !ok {
e = []int{i}
g.index[v] = e
} else {
e = append(e, i)
g.index[v] = e
}
}
fmt.Printf("Cache: %q \nIndex: %v\n", g.cache, g.index)
}
After splitting up and adding it to our index we'll transform our words
slice further by removing the stopwords and adjusting casing to satisfy our needs.
In this filterWords()
method we can work on our string slice directly to not allocate any additional memory.
But we'll have to use pointers when we want to cut the slice in length.
Pro tip - read more about it in this post
func filterWords(p *[]string, keep FilterFunc, transform TransformFunc) {
// we have to use pointers here
// to make the removal of our entries work
s := *p
n := 0
for _, x := range s {
// keep and transform are defined within the Config struct
if keep(x) {
s[n] = transform(x)
n++
}
}
s = s[:n]
*p = s
}
Now, after including some steps in our data pipeline in the Add()
method, our example data from before looks like this:
Before:
Cache: ["Mama, just killed a man"
"Put a gun against his head, pulled my trigger, now he's dead"]
Index: map[Mama:[0] Put:[1] a:[0 1] against:[1]
dead:[1] gun:[1] he:[1] head:[1] his:[1] just:[0] killed:[0]
man:[0] my:[1] now:[1] pulled:[1] s:[1] trigger:[1]]
After:
Cache: ["Mama, just killed a man"
"Put a gun against his head, pulled my trigger, now he's dead"]
Index: map[against:[1] dead:[1] gun:[1] head:[1]
his:[1] just:[0] killed:[0] mama:[0] man:[0]
now:[1] pulled:[1] put:[1] trigger:[1]]
As you can see, everything is lowercase and the words which are less than three characters long are removed.
We achieved this by defining some additional func
parameters in our Config
struct.
The below shown type
definitions are used in the filterWords()
method.
// FilterFunc checks if a part in a string should be kept or not
type FilterFunc func(part string) bool
// TransformFunc is responsible for transforming input strings into a better and normalized format
type TransformFunc func(part string) string
The FilterFunc
takes a string and checks if it should be kept or not.
It returns true to keep an entry and false to drop it.
The second method, TransformFunc
modifys every word into a normalized format for our index.
In our case it's just setting the character casing to lowercase.
We'll also use this in our search method to ensure that we extract the exact same representation of our words to be searched.
In our default configuration, it is defined like this:
// DefaultConfig is the default configuration for GlSearch
var DefaultConfig = Config{
Seperators: []rune(" ,.!/'"),
KeepFunc: func(part string) bool {
if len(part) < 3 {
return false
}
return true
},
TransformFunc: func(part string) string {
return strings.ToLower(part)
},
}
Now everything is prepared to finally search through our index.
Part three of this series will cover this topic.
Stay tuned
Top comments (0)