This series is about creating a data structure named inverted index.
It is a central component in probably every search engine out there, and a very nice excercise to apply computer science skills.
This search engine will be written in Go and I am going to use this excercise to point out how I move from zero to a Go package.
The core component is of course the data structure and it's efficient way of searching arbitrary text data.
View the wikipedia article about the inverted index
For sure, this projects it not an enterprise-ready solution, there are probably better and more efficient search engines out there.
However I think this is a very nice example to improve skillset and learn something about this type of index.
Taking all that into consideration, I came up with the following requirements:
- Create an instance of our search engine
- Add arbitrary strings to our search engine cache
- Search through all our strings in the cache
For now, I'll cope with these.
Define the Interface
Looking through my demands on the core data structure and the accessor interface, I'll define the methods that I'd like to use from an external perspective.
// This is pseudo code
// Create a new instance, configurable with a struct
New(config) instance
// Add strings to the search engine cache
Add(string)
// The method for our search execution
Find(string) []string
Now the interface for our external view is defined. We'll work on our datastructures.
We start with the quintessence of our search engine - the inverted index.
Every major search engine makes use of this index type. Basically its a mapping of words to where they are stored in our cache.
So if we store the first verse of Bohemian Rhapsody
in our index, we'd like our algorithm to tell us where the word mama
is.
Mama, just killed a man <-- mama is here
Put a gun against his head, pulled my trigger, now he's dead
Mama, life had just begun <-- mama is also here
But now I've gone and thrown it all away
Mama, ooh, didn't mean to make you cry <-- mama is here too
If I'm not back again this time tomorrow
Carry on, carry on as if nothing really matters
Every line of Bohemian Rhapsody will be an entry in a string slice. We call it our cache
.
For our lookup-table (call it index
) we take a map which will have a string as key and as value we store the position in the cache
slice.
Since there are possibly more occurences of a single string, we will also pick a slice of integers here.
type GlSearch struct {
config Config
cache []string
index map[string][]int
}
This is now our GlSearch
struct where we can implement our methods. The most simple will be the New
method:
func New(c Config) *GlSearch {
glsearch := GlSearch{
config: c,
cache: []string{},
index: map[string][]int{},
}
return &glsearch
}
This method just initializes our struct to avoid nil pointer dereference errors afterwards.
Next method attached to our struct:
// Add appends the provided string to the cache
// and updates the index accordingly
func (g *GlSearch) Add(s string) {
// the zero-based position our next entry in the cache will take
i := len(g.cache)
g.cache = append(g.cache, s)
// transform our string to build up a useful index
// split in words using space characters
words := split(s, g.config.Seperators)
// update the index for every word
for _, v := range words {
// append or create if it does not exist already
if e, ok := g.index[v]; !ok {
e = []int{i}
g.index[v] = e
} else {
e = append(e, i)
g.index[v] = e
}
}
fmt.Printf("Cache: %q \nIndex: %v\n", g.cache, g.index)
}
Here I introduced the first configuration parameters. It's the Seperators
.
Time by time when I notice that I want some parameter for my package to be configurable, I introduce a new field for our Config
struct.
I typically go with this approach because at the beginning I don't exactly know what should be configurable
but is always handy to do so.
The split method
// split turns the input string into a slice of strings
// by seperating where any of the runes in the sep slice match
func split(s string, sep []rune) []string {
return strings.FieldsFunc(s, func(r rune) bool {
for _, v := range sep {
if v == r {
return true
}
}
return false
})
}
If you're new to the Go programming language, you may not be familiar with the
rune
datatype.
Read more about it at yourbasic.org
func main() {
search := glsearch.New(glsearch.DefaultConfig)
search.Add("Mama, just killed a man")
search.Add("Put a gun against his head, pulled my trigger, now he's dead")
// our contents in cache and index right now
// Cache: ["Mama, just killed a man"
// "Put a gun against his head, pulled my trigger, now he's dead"]
// Index: map[Mama:[0] Put:[1] a:[0 1] against:[1]
// dead:[1] gun:[1] he:[1] head:[1] his:[1] just:[0] killed:[0]
// man:[0] my:[1] now:[1] pulled:[1] s:[1] trigger:[1]]
}
Now the index shows the desired representation of our data so far.
By examining the Index data structure precisely, the attendive reader already notices some obstacles and drawbacks in our data.
We'll discuss this in the next part of my series.
Top comments (0)