Łukasz Pluszczewski for Brainhub

Posted on Oct 17, 2023 • Edited on Oct 26, 2023

Make Notion search great again: Notion API

#vectordatabase #semanticsearch #api #productivity

In this series, we’re looking into the implementation of a vector index built from the contents of our company Notion pages that allow us not only to search for relevant information but also to enable a language model to directly answer our questions with Notion as its knowledge base. In this article, we will explore the Notion API.

Before we can create a searchable index, we need to get the contents of the Notion pages. Let’s see how we used Notion API to do that.

Notion integrations

Before we can start using Notion API we must create an integration, sometimes called a “Connection” in the UI. An “Integration Secret” is generated for each integration which can be then used to access API. You can select permissions for the connection which Notion calls “capabilities”. Our index does not write anything to indexed pages, so we selected only “read” permissions. We also allowed the integration to read user information so that we can replace mentions with people's real names.

Integration can only access pages that it has been manually added to. This access also extends to all of its descendant pages. Keep in mind that what an integration has access to is not as straightforward as you might think. Removing it from a page may not propagate if either permissions or integration access to some child page has been separately modified. Also, integration has access to non-public pages! Because our solution will give indirect access to all indexed pages to everyone in the company (via the language model answers), we’ve made sure that we’re assigning it only to the pages we’re absolutely sure don’t contain any private information.

Notion client

To communicate with Notion API we’ve used a dedicated typescript library @notionhq/client

Below is an example of configuration and simple request:



const client = new NotionClient({
  auth: 'abcdef123',
  logLevel: LogLevel.WARN,
});

await client.pages.retrieve({ page_id: '3853acec-eebc-42e2-843b-2c340f769b80' });

Besides retrieving pages, we can also fetch databases, blocks, block’s children, etc.



await client.databases.retrieve({ database_id: 'c2da9700-8244-4bc0-bff1-8dccd909b211' });
await client.blocks.retrieve({ block_id: 'b00afc3d-1db2-4cf3-9801-868bd84f06f8' });
await client.blocks.children.list({ block_id: 'b00afc3d-1db2-4cf3-9801-868bd84f06f8' });

Rate limiting

Notion API doesn’t have a hard limit. You’re expected to not exceed an average of 3 req/s but occasional bursts above that are allowed. That’s why our internal rate limiter allows slightly more frequent requests, with proper rate limit error handling just in case. If you reach the rate limit, the Notion API will respond with a specific error code and a “retry-after” header that indicates the wait time in milliseconds.

To ensure that we handle API’s rate limits correctly, we’ve implemented an API client wrapper that handles errors appropriately. Below is a simplified example of rate limit handling:



async request() {
  try {
    return await client.databases.retrieve({ database_id: 'c2da9700-8244-4bc0-bff1-8dccd909b211' });
  } catch (error) {
    if (!isNotionApiErrorOfType(error, APIErrorCode.RateLimited)) {
      throw error
    }

    const retryAfter = parseInt(error.headers.get('retry-after'));

    return delay(
      () => request(),
      retryAfter * 1000,
    );
  }
}

Pagination

Most endpoints have a limit on the number of entries returned and provide a “cursor” if there is more data to fetch. Below is a simple example of how to handle the pagination if we want to load all data - a function that fetches all pages:



private async fetchAllPages(query?: string, cursor?: string) {
  const response = await client.search(query, {
    start_cursor: nextCursor,
  });

  if (response.next_cursor) {
    return [
      ...response.results,
      ...(await this.fetchAllPages(query, response.next_cursor, index + 1)),
    ];
  }
  return response.results;
}

Since the Notion API does not have a “get all pages” endpoint, the function above uses the search endpoint with an empty query to retrieve all pages. While it is not a reliable way of doing that, as for example recently added pages or databases may have not been indexed yet and are not going to be returned, we’ve decided that it’s good enough for now.

Blocks

Texts in Notion are structured around “blocks” which are the basic units of content. Whatever you add to the page is a block: a paragraph, a list, a table, and so on. Each block can be standalone, like a paragraph with just some text in it, or have child blocks, like a list item containing a sub-list, etc. Below is an example of a block (from notion documentation):



{
    "id": "c02fc1d3-db8b-45c5-a222-27595b15aea7",
    "type": "heading_2",
    "heading_2": {
        "rich_text": [
            {
                "type": "text",
                "text": {
                    "content": "Lacinato kale",
                    "link": null
                },
                "annotations": {
                    "bold": false,
                    "italic": false,
                    "strikethrough": false,
                    "underline": false,
                    "code": false,
                    "color": "green"
                },
                "plain_text": "Lacinato kale",
                "href": null
            }
        ],
        ...
    }

There are more properties (like “parent” or “last_edited_time”) that we’ve hidden so that we can focus on what’s important. Each block has a “type” property that tells us what kind of block it is, but also, where to get its contents from. Different blocks have different data structures, so we have a separate piece of code, called “parser” to handle each block type. Below are two examples of parsers:



[BlockTypes.NumberedListItem]: {
  parse: (block, ctx) => {
    const number =
      'number' in block.numbered_list_item
        ? block.numbered_list_item.number
        : null;
    const text = getPlainText(block.numbered_list_item.rich_text, ctx);
    return number ? `${number}. ${text}` : text;
  },
},
[BlockTypes.ToDo]: {
  parse: (block, ctx) => {
    const text = getPlainText(block.to_do.rich_text, ctx);
    const checkbox = block.to_do.checked ? '[X]' : `[ ]`;
    return `${checkbox} ${text}`;
  },
},

The “getPlainText” function is a simple helper that converts a rich_text array into a string. Additionally, it receives “context” containing the list of users so that it can replace all mentions with actual names.

The “rich_text” property contains an array of elements that we need to parse. We have a simple “getPlainText” function that converts that to just a string. Our parsers return text formatted as markdown, as it is easily understandable by LLMs, and also, unlike HTML, doesn’t leave much garbage after removing special characters for embeddings.

Since blocks can have child blocks, we fetch blocks recursively:



private async getBlocksRecursively(pageOrBlockId: string): Promise<BlockObjectResponse[]> {
    const blocks = await this.notion.blocks.children.list({
            block_id: pageOrBlockId,
        });

    return await Promise.all(
      blocks.results.flatMap(async (block) => {
        if (block.has_children) {
          return {
            ...block,
            children: await this.getBlocksRecursively(block.id),
          };
        }
        return block;
      }),
    );

By gathering all blocks recursively and converting them to text, we get a nice, markdown formatted content of the page.

Pages and Databases

Notion organizes its content into pages. In addition to the block contents described above, each page can also have properties. They are similar to blocks, but they have keys, don't have children, their ID is not a UUID, and lack certain properties like 'last_edited_time'. The possible types and formats of property values are the same as those of blocks, so we use the same code to parse them. Below is the example of property, with the key “When” and type “date”:



"When": {
    "id": "some-id",
    "type": "date",
    "date": {
        "start": "2023-03-23",
        "end": "2023-05-05",
        "time_zone": null
    }
},

Notion also includes databases, which are collections of pages that can be filtered, sorted, and organized as needed. When you view the database as a table, what you see in the column contains the value of the corresponding “property” of the given page. In our index, we represent databases and simple tables in the same way.

Pages that are members of databases, in addition to their properties, can also have ordinary content. In other words, with each “row” being a page, each “column” is a property of that page, but because the page itself works just like any other, users can add normal content to it: paragraphs, lists, images, and so on. Because the content is not visible in any database view in Notion and is not visible in our representation of a database, we additionally index all members as separate pages.

Retrieving data from Notion can be unintuitive and sometimes tedious, especially when dealing with permission management and handling different types of blocks. However, we've successfully parsed all page and database contents into clean markdown texts. The only thing left to do is to build a vector index from these contents, but we'll cover that in the next article. Stay tuned!

DEV Community

Make Notion search great again: Notion API

Notion integrations

Notion client

Rate limiting

Pagination

Blocks

Pages and Databases

Top comments (0)

Read next

Top 9 Generative AI Skills You Should Learn

How I built my first mechanical keyboard [Tutorial] [Part 2]

Simplifying Dependency Injection in .NET 9: Enhancements and Best Practices

Scribe - An Obsidian plugin that places your voice notes right where you want them..with your other notes