Łukasz Pluszczewski for Brainhub

Posted on Nov 20, 2023

Make Notion search great again: Vector Database

#vectordatabase #semanticsearch #ai #openai

In this series we’re looking into the implementation of a vector index built from the contents of our company Notion pages that allow us not only to search for relevant information but also to enable a language model to directly answer our questions with Notion as its knowledge base. In this article, we will see how we’ve used a vector database to finally achieve this.

Numbers, vectors, and charts are real data unless stated otherwise

Last time we downloaded and processed data from Notion API. Let’s do something with it.

Vector Database

To find semantically similar texts we need to calculate the distance between vectors. While we have just a few short texts we can brute-force it: calculate the distance between our query and each text embedding one by one and see which one is the closest. When we deal with thousands or even millions of entries in our database, however, we need a more efficient way of comparing vectors. Just like for any other way of searching through a lot of entries, an index can help here. To make our life easier we’ll use Weaviate DB - a vector database that implements the HNSW vector index to improve the performance of vector search.

There are a lot of different vector database you can use. We’ve used Weaviate DB because it has reasonable defaults, including vector and BM25 indexes working out of the box and a lot of features that can be enabled with modules (like “rerank” mentioned before). You can also consider postgres extension “pgvector” to take advantage of SQL goodness: relations, joins, subqueries and so on while weaviate may be more limited in that regard. Choose wisely!

I may revisit the topic of vector indexes in the future but in this article I’ll just use the database that implements it. To learn more about HNSW itself look here, and to learn more about configuring vector index in Weaviate DB look here.

Weaviate DB

Weaviate DB is an open-source, scalable, vector database that you can easily use in your own projects. The vector goodness is just one docker container away and you can run it like this:

docker run -p 8080:8080 -d semitechnologies/weaviate:latest

Weaviate is modular, and there are a number of modules allowing you to add functionality to your database. You can provide the embedding vectors to the database entries yourself, but there are modules to calculate those for you, like text2vec-openai module that uses the openAI API. There are modules allowing you to easily backup your DB data to S3, add rerank functionality to your searches, and many more. Enabling a module is as simple as adding an environment variable:

docker run -p 8080:8080 -d \
  -e ENABLE_MODULES=text2vec-openai,backup-s3,reranker-cohere \
  semitechnologies/weaviate:latest

Now, to connect to the database from our typescript project:

import weaviate from 'weaviate-ts-client';

const client = weaviate.client({
  scheme: 'http',
  host: 'localhost:8080',
});

All the data in Weaviate DB is stored in classes (equivalent to tables in SQL or collections in MongoDB), containing data objects. Objects have one or more properties of various types, and each object can be represented by exactly one vector. Just like SQL databases, Weaviate is schema-based. We define a class with its name, properties, and additional configuration, like which modules should be used for vectorization. Here is the simplest class with one property.

{
  class: 'MagicIndex',
  properties: [
    {
      name: 'content',
      dataType: ['text'],
    },
  ],
}

We can add as many properties as we like. There are a number of types available: integer, float, text, boolean, geoCoordinates (with special ways to query based on the location), blobs, or lists of most of these like int[] or text[]:

{
  class: 'MagicIndex',
  properties: [
    { name: 'content', dataType: ['text'] },
    { name: 'tags', dataType: ['text[]'] },
    { name: 'lastUpdated', dataType: ['date'] },
    { name: 'file', dataType: ['blob'] },
    { name: 'location', dataType: ['geoCoordinates'] },
  ],
}

You can also control how, and for what properties the embeddings are going to be calculated if you don’t want to provide them yourself:

{
  class: 'MagicIndex',
  properties: [
    { name: 'content', dataType: ['text'] },
    {
      name: 'metadata',
      dataType: ['text'],
      moduleConfig: {
        'text2vec-openai': {
          skip: true,
        },
      },
    },
  ],
  vectorizer: 'text2vec-openai',
}

In this case, we’re going to use the text2vec-openai module to calculate vectors but only from the content property.

Weaviate stores exactly one vector per object so if you have more fields that are vectorized (or you have vectorizing class name enabled) embedding is going to be calculated from concatenated texts. If you want to have separate vectors for different properties of the document (like different chunks, title, metadata etc.) you need separate entries in the database.

Applying a schema is as simple as:

await client.schema
  .classCreator()
  .withClass(classDefinition)
  .do();

Let’s see what the data objects look like in our Notion index:

{
  pageTitle: 'Locomotive Kinematics of Quick Brown Foxes: An In-Depth Analysis of Canine Velocity Over Lazy Canid Obstacles',
  chunk: '1',
  originalContent: '# Abstract\n\nThe paradigm of quick brown foxes leaping over lazy dogs has long fascinated both the scientific community and the general public...',
  content: 'abstract\nthe paradigm of quick brown foxes leaping over lazy dogs has long fascinated both the scientific community and the general public...',
  pageId: 'dfda9d5d-b059-4186-95f4-7cb8cdf42545',
  pageType: 'page',
  pageUrl: 'https://www.notion.so/LeapFoxSolutions/dfda9d5d-b059-4186-95f4-7cb8cdf42545',
  lastUpdated: '2023-04-12T23:20:50.52Z'
}

Let’s get what is obvious out of the way: we store the page title, its ID, URL, and the last update date. We also vectorize only content property: the vectorizer ignores the title, originalContent, and so on.

You probably noticed a chunk property though. What is it? For vectors to work best it is preferable that texts are not too long. They are generally used for texts not longer than a short paragraph so we split the contents of Notion pages into smaller chunks. We’ve used the lanchain's recursive text splitter. It tries to split the text first by double newline, if some chunks are still too long by a single new line, then by spaces, and so on. This way we keep paragraphs together if possible. We’ve set the target chunk length to 1000 characters with a 200-long overlap.

The length of the chunks and the way you split them can have a huge impact on vector search performance. It is generally assumed that chunk size should be similar to the length of the query (so during the search you compare vectors of similarly sized texts). In our case chunks 1000 characters long, although pretty big, seem to work best but your mileage may vary. Additionally, we also make sure that table rows are not sliced in half to avoid “orphaned” columns. This is a huge topic and I may revisit it in one of the future posts.

We save each chunk separately in the database and the chunk property is an index of the chunk. Why is it string and not number though? Because we don’t vectorize the title property, we save a separate entry for it that looks like this:

{
  pageTitle: 'Locomotive Kinematics of Quick Brown Foxes: An In-Depth Analysis of Canine Velocity Over Lazy Canid Obstacles',
  chunk: 'title',
  originalContent: 'Locomotive Kinematics of Quick Brown Foxes An In-Depth Analysis of Canine Velocity Over Lazy Canid Obstacles',
  ...
}

In the future, we may decide that we want to vectorize more properties of the page than just content and title. We can do that easily just by adding a new possible value to the chunk property.

What’s the deal with content and originalContent properties? To spare the vectorizer some noise in the data, we prepare a cleaned-up version of each chunk. We remove all special characters, replace multiple whitespaces with a single one, and change the text to lowercase. In our testing, vector search is slightly more accurate with this simple cleanup. We still keep originalContent though because this is what we pass to rerank and use for traditional, reverse index search.

Lastly, we have pageType property which is just a result of a Notion quirk: a page in Notion can be either a page or a database. As mentioned in the previous article, we treat both the same way in our index: databases are converted to simple tables.

Ok, we have an idea of what data we are going to store in the database, but how to add, fetch, and query that data?

Weaviate interface

Weaviate offers two interfaces to interact with it, RESTful and graphQL APIs and it is reflected in the available typescript client methods. We will focus on the graphQL interface. To get entries from the database, we need to simply provide a class name and the fields we want to get

client.graphql
  .get()
  .withClassName('MagicIndex')
  .withFields('pageTitle originalContent pageUrl');

It is recommended that each query is limited and uses cursor-based pagination if necessary:

client.graphql
  .get()
  .withClassName('MagicIndex')
  .withFields('pageTitle originalContent pageUrl')
  .withLimit(50)
  .withAfter(cursor);

Let’s add some entries to the database:

await client.data
  .creator()
  .withClassName('MagicIndex')
  .withProperties({
    pageTitle: 'Vulpine Agility vs. Canine Apathy: A Comparative Study',
    chunk: '2',
    originalContent: '## Background \n\n Though colloquially immortalized in typographical tests, the scenario of a quick brown fox vaulting over a lazy dog presents...',
    content: 'background\nthough colloquially immortalized in typographical tests the scenario of a quick brown fox vaulting over a lazy dog presents...',
    pageId: '1ba0b851-d443-4290-8415-3cd295850d14',
    pageType: 'page',
    pageUrl: 'https://www.notion.so/LeapFoxSolutions/1ba0b851-d443-4290-8415-3cd295850d14',
    lastUpdated: '2023-03-01T12:21:30.12Z'
  })
  .do();

With vectorizer enabled for MagicIndex class, that’s all we need to do. The entry is added to the database together with its vector representation calculated by OpenAI’s ADA embedding model. Now we can search for texts about foxes and dogs all day long.

Traditional search

Weaviate allows us to search with traditional reverse index methods too! We have a bag-of-words ranking function called BM25F at our disposal. It’s configured with reasonable defaults out of the box. Let’s see it in action:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withBm25({
    query: 'Can the fox really jump over the dog?',
    properties: ['originalContent'],
  })
  .withLimit(5)
  .withFields('pageTitle originalContent pageUrl _additional { score }')
  .do();

You can see the _additional property that we can request in the query. It can contain various additional data related to the object itself (like its ID) or the search (like BM25 score or the cosine distance in case of vector search).

Vector search

Of course, a reverse index search will not find many texts that, while talking about brown foxes, don’t use those words. Thankfully, semantic search is as easy to perform:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withNearText({ concepts: ['Can the fox really jump over the dog?'] })
  .withLimit(5)
  .withFields('pageTitle originalContent pageUrl _additional { distance }')
  .do();

There is some additional magic that we can do to make the search even better like setting the maximum cosine distance that we accept in the search results, or using the autocut feature:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withNearText({
    concepts: ['Can the fox really jump over the dog?'],
    distance: 0.25,
  })
  .withAutocut(2)
  .withLimit(10)
  .withFields('pageTitle originalContent pageUrl _additional { distance }')
  .do();

Now, not only do we get only results with cosine distance less than 0.25 (that’s what distance setting in withNearText method does), but additionally, weaviate’s autocut feature will group the results by similar distance and return the first two groups (more on how autocut works here).

But that’s not all. We can also make the search like some concepts and avoid some others:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withNearText({
    concepts: ['Can the fox really jump over the dog?'],
    moveAwayFrom: {
      concepts: ['typography'],
      force: 0.45,
    },
    moveTo: {
      concepts: ['scientific'],
      force: 0.85,
    },
  })
  .withFields('pageTitle originalContent pageUrl')
  .do();

While the example with foxes is a little silly, you can imagine many scenarios where that feature can be really useful. Maybe you’re looking for “ways to fly” but you want to move away from “planes” and move toward “animals”. Or you may search for a query, but keep the results similar to some other object in the database:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withNearText({
    concepts: ['Can the fox really jump over the dog?'],
    moveTo: {
      objects: [{ id: '84ab0371-a73b-4774-8b03-eccb97b640ae' }],
      force: 0.85,
    },
  })
  .withFields('pageTitle originalContent pageUrl')
  .do()

There are many other features that you may want to experiment with. Read more on those in the Weaviate documentation.

Hybrid search

Finally, we can combine the power of vector search with the BM25 index! Here comes the hybrid search which uses both methods and combines them with a given weights:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withHybrid({
    query: 'Can the fox really jump over the dog?',
  })
  .withLimit(5)
  .withFields('pageTitle originalContent pageUrl _additional { distance score explainScore }')
  .do();

In _additional.explainScore property, you will find the details about score contributions from vector and reverse index searches. By default, the vector search result has a weight of 0.75 and a reverse index: 0.25, and those are the values we use in our Notion search. More about how hybrid search works and how to customize the query (including how to change the way vector and reverse index results are combined) can be found here.

Rerank

If we enable the rerank module, we can use it to improve the quality of search results. It works for any search method: vector, BM25, or hybrid:

await client.graphql
  .get()
  .withClassName('MagicIndex')
  .withHybrid({
    query: 'Can the fox really jump over the dog?',
  })
  .withLimit(100)
  .withFields('pageTitle originalContent pageUrl _additional { rerank(property: "originalContent" query: "Can the fox really jump over the dog?") { score } }')
  .do();

Adding a rerank score field to the query will make Weaviate call a rerank module and reorder the results based on the score received. To increase the chance of finding relevant results, we’ve also increased the limit: now rerank has more texts to work on and can find relevant results even if we had a lot of false positives from a hybrid search.

Summary

To summarize. In our Notion index we’ve used Weaviate DB with the following modules:

text2vec-openai enabling Weaviate to calculate embeddings using OpenAI API and ADA model
reranker-cohere allowing us to use CohereAI’s reranking model to improve search results
backup-s3 just to make it easier to backup data and migrate between environments

To get the data to index, we fetch all Notion pages using a search endpoint with an empty query. In each page, we recursively fetch all blocks that are then parsed by a set of parsers: specific for each type of block. We then have a markdown-formatted string for each page.

We then split the contents of each page into chunks: 1000 characters long with 200 characters of overlap. We also “clean up” the texts by removing special characters and multiple whitespaces to improve the performance of vector search.

The data for each page chunk is then inserted into the database with a fairly straightforward schema. We have an index of the chunk and some properties of the Notion page: URL, ID, title, and type. Additionally, we keep both original, unaltered content and cleaned-up versions but we calculate embeddings only from the latter.

To find information in the index, we use the hybrid search with a default limit of 100 chunks, with rerank enabled by default.

What worked and what didn’t

So, the $100mln question. Does it work?

Absolutely! We have a working semantic search that allows us to reliably search for information even without using the exact wording used on the pages we’re looking for. You can search for “parking around the office” or “where to leave my car around the office” or even just “parking?”. How to use a coffee machine? What benefits are available in Brainhub? Which member of the team is skilled in martial arts? Who should I talk to if I want a new laptop? What are Brainhub’s values?

Not everything works perfectly though. Finding information in large tables (e.g. we have a table with team members - long, with a lot of columns and long texts inside) may be challenging if you’re not smart in chunking them e.g. by ensuring that one row is in one chunk even if very long to avoid orphaned columns. Even then the search is not perfect e.g. when asking who is a UX in our team, it may find a chunk with one person out of three UX designers in a table. While this is fine for search (in search results, you still get the link to the correct page that contains the whole table) it may not be enough for a Q&A bot that may miss some information because of it.

Another issue is noise. One of the reasons we wanted a better search was thousands of pages of meeting notes, outdated guidelines, and other mostly irrelevant stuff that lurks in the depths of our Notion workspace. We did implement some mitigations to improve search results and get rid of noise, like lowering the “search score” of old pages but it was not enough. The best method was still manually excluding areas that were most problematic. That’s not ideal of course, we would like our search engine to figure out what’s relevant automatically so that’s something to do more research on.

In general though, the results are more than satisfactory and, while there were a lot of small tweaks here and there needed, we’ve managed to create a Notion search that actually works.

DEV Community

Make Notion search great again: Vector Database

Vector Database

Weaviate DB

Weaviate interface

Traditional search

Vector search

Hybrid search

Rerank

Summary

What worked and what didn’t

Top comments (0)

Read next

Conjuring Cursed Halloween Tales with Qdrant's Dark Arts

Make prototyping easier with Webcrumbs 🤩

Onboarding Dev

The Importance of Data for GenAI