Language is a funny thing. For example, we take for granted that Cinderella wore glass slippers. Only in a fairy tale can people walk in glass shoes. But maybe it’s metaphorical – the fragility of being Cinderella. Or maybe it’s simply a mistranslation of the original Latin word “vair” (squirrel fur) for the French “verre” (glass).
Language is also hard to pin down. Especially when lost in translation. But no need to despair – sometimes what gets lost can have surprising and moving results.
But we don’t always want to be surprised. Like when we ask direct questions or search for specific items that match our queries. At that point, we aspire to be crystal-clear.
That’s where dictionaries come in. Dictionaries allow us to be clear, by reinforcing the clarity of each word within the context of the larger phrase. We use custom dictionaries to reinforce our natural language processing (NLP). Here’s how.
Many users still type questions like “What is the best search engine?” instead of the shorter “best search engine”. It’s natural for them to type the way they speak. But other people prefer to use shorter, incomplete phrases to return the same results. With advances in search technology, even nonsensical queries, such as “engine best search”, return great results.
Nevertheless, the full phrase is still in fashion – even more so with voice. The success of voice search depends on allowing people to speak naturally. And stop words are key to that. Stop words reduce a natural phrase to its bare essence: keywords. By dropping such words as “what”, “is”, and “the” from the above query, and leaving only the keywords “best”, “search”, and “engine”, the search engine can match the query to the underlying data in a more reliable and relevant way.
Granted, all words are important – “What” and “Why” are indeed meaningful distinctions – but if a search algorithm relies on textual matching (as opposed to matching on meaning or semantics), its only job is to compare characters and words. By removing stop words, therefore, you remove the false positives that match on the word “the”.
We can say the same about normalization (e.g., removing accents), plurals. Any search algorithm that focuses on text, and not the meaning of text, should ignore textual variations (like plurals) to enable a more relevant and non-ambiguous word matching.
Lastly, textual matching also needs to separate words into useful parts. A “boat house” is not a house or a boat but a boat specially made to be used as a house. To help reach that level of precision, a textual search algorithm needs to break out the constituent parts of a word (atoms), by using techniques like segmentation and decompounding.
The goal of segmentation or decompounding is not to understand the meaning of words, but to find out what a complex word can be decomposed into. We’re trying to find the “atoms” of the word. We don’t use it in English because most words are already decompounded, it’s in the language’s DNA. Same for French. But German, for example, Hundehütte, meaning "dog kennel", is composed of “Hund” (dog) and “Hütte”‘ (kennel/house). The space we already have between the two words in English is why we don’t need decompounding. Segmentation is essentially the same thing, but for languages where there’s no space at all (i.e., most Asian languages).
That’s where dictionaries come in.
One approach to natural language processing is to use dictionaries, such as a stop-word dictionary, plurals dictionary, and a compound-word dictionary. For example, you can parse a downloaded list of stop words from Wiktionary, not only in English, but in many other languages.
Here’s the process we used:
- Download the full wiktionary dictionary – words, definitions, and much more
- Extract the words
- Store them in text files
- Compile them into a binary format
- Optimize code for performance We do this for every language and it works fairly well for most use cases. But when it doesn’t work, it breaks relevance – which is a critical show-stopper for search engines. Here are some problems we encountered
“Down” is a reasonable stop word, except when your searching for “down jackets”. Companies who sell “leather”, “suede”, and “down” jackets cannot remove “down” from the query.
Languages that use accents, like French and Spanish, fare well when normalized with accent removal. For example, “voilà” to “voila” causes no loss in meaning. In fact, it’s rare in French that removing an accent would create an ambiguity. German is not so lucky. For example, the accented “ä”, when normalized to “a”, will change the meaning of some words.
A curious example of this is the German word “wählen”, meaning “to choose” in English. If you remove the accent, most people will not object – except for the 1500 residents of the small German-speaking Swiss town, Wahlen. It might be hard to find the town “Wahlen” among the many results that match on “wählen” – thus, hurting tourism in that part of the world.
The solution is to do a special custom normalization for german. In this case, normalize “ä” to “ae”. Here’s a complete list:
ä → ae ö → oe ü → ue Ä → Ae Ö → Oe Ü → Ue ß → ss (or SZ for capital)
But this leads to a second problem, which illustrates the gymnastics search engines go through when dealing with languages. (Remember, language is funny…) So we normalize “für” to “fuer”, but now we lose the stop word “fur”, because the now normalized “fuer” is not a stop word.
That’s where custom dictionaries come in.
We realized that one dictionary per category wasn’t enough, we needed to come up with an additional dictionary per customer that they could use to override the defaults of Wiktionary or add their own words. So now we have two dictionaries per category (stop words, plurals, etc.): one per language, which we ship out with our software, and one custom dictionary per customer, which they can add words to. Adding custom dictionaries – meaning, allowing each customer to override and add their own words to our dictionaries – required a bit of refactoring in how we dealt with our standard dictionaries: each dictionary-retrieval function had a different interface and each dictionary dataset had different formats. So the first step was to normalize our code and data.
We examined the current dictionaries that we shipped out to our customers as part of our base product. We wanted to abstract the similarity of every dictionary. Since they all had the same kind of data and goals, we were able to do the following:
- Create similar data = a list of words
- Code the same goal = the ability to retrieve words To put these dictionaries into a single interface, the main tasks included (in this order):
- Transforming all the dictionary datasets to have the same data structure: the trie
- Migrating existing dictionaries to the new format In the end, our new format for the plurals dictionary file is:
[2-letter country code]=[word1,word2,..]
In keeping with our introduction, here’s a good example of plurals:
en=feet,feets,foot,foots en=slipper,slippers en=squirrel,squirrels en=fur,furs en=Cinderella,Cinderellas en=Cinderfella,Cinderfellas
That’s the first part: unifying both the interface and structure of the data.
With that, we achieved the following goals:
- A simpler dictionary interface for all dictionaries
- Mutualized toolings and tests
- Easier to maintain
Now that we had a single interface for every dictionary, we were able to integrate customer-defined words for every NLP technique, for example, customer-specific stop words (see “down” example above), customer-specific normalization (see “für” example above), and so on.
These custom dictionaries are added to the index on top of the static dictionaries. We prioritized the dictionary lookups: a query first consults the custom dictionary before the static one. If the word is found, then the engine doesn’t need to look at the static dictionary.
And that’s it: our customers can now slip on one slipper and help their own customers find the other slipper(s) – in fur or glass.