DEV Community

Łukasz Pluszczewski for Brainhub

Posted on • Updated on

Make Notion search great again: semantic search

In this series, we’re looking into the implementation of a vector index built from the contents of our company Notion pages that allow us not only to search for relevant information but also to enable a language model to directly answer our questions with Notion as its knowledge base. In this article, we will see how to use vector embeddings to search and how to improve its performance.

Numbers, vectors, and charts are real data unless stated otherwise

Last time we explored vector embeddings and their main utility for our case: distances between them represent semantic similarities between texts. Let’s see them in action.

Airbus or Boeing?

Let’s consider the following texts:

• "Pope John Paul II was the first non-Italian pope in more than..."
• "Pope Francis is the head of the Catholic Church, the bishop..."
• "Nicolaus Copernicus, a Renaissance-era astronomer..."
• "Johannes Kepler was a German astronomer, mathematician, astrologer,..."
• "The Tesla Model 3 is an electric car produced by..."
• "The Ford Focus is a compact car manufactured by Ford..."
• "The Ford Mustang is a series of American automobile..."
• "The Dodge Challenger is the name of three different..."
• "The Boeing 737 is a narrow-body aircraft produced ..."
• "The Airbus A380 is a large wide-body airliner that..."
• "The Airbus A320 family consists of short to..."
• "Salamanders are a group of amphibians typically..."
• "The dog is a domesticated descendant of the wolf..."
• "The cat is a domestic species of small carnivorous mammal..."
• "Elephants are the largest living land animals..."
• "The tiger (Panthera tigris) is the largest living..."
• "Rabbits, also known as bunnies or bunny rabbits..."

We have texts about cars, planes, animals, two popes, and two astronomers. We can calculate embeddings for each text and see how far they are from each other. Using the OpenAI's ADA-2 model, we would get 1500-dimensional vectors so we would have a hard time visualizing it. But we have a tool up our sleeves that will help us out. What is that? That’s right, embeddings 🙂

There is nothing stopping us from calculating 2-dimensional embeddings of those vectors so that we can see relationships between them on a flat screen. This time, the small embeddings were calculated algorithmically, not by neural network. Below is the result:

If you’re curious about how to reduce the dimensionality of vectors using the same algorithm, you can read more here

You can find a few interesting things here. Firstly, it’s clear to see that different categories of texts are clearly separated: animals, cars, people, and planes have their own place in the chart and are quite far from other categories. But that’s not all. Two popes are close together and a little further away from astronomers. A dog is close to a cat, but quite far away from a tiger, which in turn is closer to a cat than to a dog or a bunny. Of course, because we decreased the dimensionality of the vectors, we’ve most likely lost a lot of semantic data that is encoded in 1500 values of original vectors. We can still see the relationships though.

While embeddings can be used as inputs to neural networks, we don’t need neural networks to use the spatial relationships between them to implement efficient semantic search. It’s enough to calculate the embedding of a search query using the same method and find its closest neighbors in the semantic space. Let’s write some queries about a few topics, calculate embeddings for those queries, and add them to our chart:

The last one is not so clear. While we see that the vector for the query is closer to the cars than people, in a search engine, Boeing 737 would still be quite high in search results for cars to buy.

Looking at the examples above, keep in mind that we don’t see the actual vectors - just their 2-D embeddings. Regardless, you can probably see the utility of the vector spaces: you can find texts that are on similar topics fairly easily. While not perfect this method is a great first step for more complex semantic search solutions. Let’s dive deeper to see what we can do with the results to make it better.

Vectors are stupid, language models are not

Well, maybe they are. But they will help us nonetheless. Let’s imagine that for the query “What car should I buy?” we’ve got the following results (not the actual vector search result):

1. “The Boeing 737 is a narrow-body aircraft...”
2. “The tiger (Panthera tigris) is the largest living...”
3. “The Ford Mustang is a series of American...”
4. “The Tesla Model 3 is an electric car produced by...”
5. “The Airbus A380 is a large wide-body airliner...”
6. ”The Dodge Challenger is the name of three...”
7. ”The Ford Focus is a compact car manufactured...”

The Boeing 737 is probably not your first choice. For such a simple query the actual results are much more accurate of course, vector search is not that stupid, but irrelevant results may appear for more complex and nuanced queries. The capable language model would clearly distinguish between a plane and a car, or between text that is roughly about the topic, and one that actually contains the answer - even the nuanced one. Here comes the rerank!

Rerank

While it’s not feasible to use a big and expensive language model to analyze thousands or even millions of texts you may have in your database, you can easily afford to let it clean up and reorder your initial vector search results if needed. That’s exactly what rerank models do. They are language models, so, unlike simple vector search, they understand the contents of the texts they are processing. They accept a query and a list of text documents and they “rerank” those documents giving them scores based on how relevant they are to the query. It’s much more expensive to use those models than to just calculate embeddings so we only use them after the initial vector search. Let’s use Cohere.ai's rerank model on our not-so-perfect car buying search (you can find rerank’s score in the brackets, while the initial order from the vector search was made up, the results from the rerank model are real):

1. [0.41] “The Tesla Model 3 is an electric car...”
2. [0.36] “The Dodge Challenger is the name...”
3. [0.34] “The Ford Focus is a compact car...”
4. [0.32] “The Ford Mustang is a series of...”
5. [0.20] “The tiger (Panthera tigris) is the...”
6. [0.08] “The Boeing 737 is a narrow-body...”
7. [0.05] “The Airbus A380 is a large...”

Now we’re talking! We have relevant results at the top thanks to rerank’s ability to actually understand the query and the texts. While it’s still not perfect (tiger seems dangerously close to the Ford Mustang for some reason), it’s enough in the vast majority of cases. Now let’s put all of that into practice and build a proper search engine!

In the next articles, we’ll see how we get the data from Notion using its API and how we used Weaviate vector database to build a searchable index out of it.