Dima

Posted on Apr 29, 2023 • Edited on May 2, 2023

Writing a Search Engine from Scratch using FastAPI and Tantivy

#python #fastapi #tantivy #rust

The importance of search engines in our daily lives cannot be overstated. They help us navigate the vast ocean of information available on the internet and make it accessible at our fingertips. In this article, I’ll guide you through the process of building a custom search engine from scratch using FastAPI, a high-performance web framework for Python, and Tantivy, a fast, full-text search engine library written in Rust.

Before diving into the code, we need to set up our development environment. First, ensure that you have Python installed on your system. FastAPI requires Python 3.6 or higher. Next, install FastAPI and its dependencies. You can do this using the following command:



pip install fastapi[all]

This command will install FastAPI and all the optional dependencies needed for its various features. Tantivy is a Rust library, so we’ll need to use a Python wrapper called “tantivy-py” to work with it. Install the wrapper using:



pip install tantivy-py

Now that we have the necessary tools and libraries installed, create a virtual environment for your project and set up your preferred IDE or text editor.

FastAPI

FastAPI is a modern, high-performance web framework for building APIs with Python. It’s designed to be easy to use and has built-in support for type hints, which allows for automatic data validation, serialization, and documentation generation. FastAPI also has excellent support for asynchronous programming, which improves the performance of I/O-bound operations.

To create a FastAPI application, you’ll need to define routes, add parameters, and create request and response models. Routes are the different URLs or paths that your API can handle, while parameters are the variables passed in the URL, query string, or request body. Request and response models describe the data structure used for input and output, respectively.

Here’s an example of a FastAPI application with routes, parameters, and request/response models:



from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class UserIn(BaseModel):
    name: str
    age: int
    city: str

class UserOut(BaseModel):
    id: int
    name: str
    age: int
    city: str

users = []

@app.post("/users", response_model=UserOut)
def create_user(user: UserIn):
    user_id = len(users)
    new_user = UserOut(id=user_id, **user.dict())
    users.append(new_user)
    return new_user

@app.get("/users/{user_id}", response_model=UserOut)
def get_user(user_id: int):
    if user_id >= len(users) or user_id < 0:
        raise HTTPException(status_code=404, detail="User not found")
    return users[user_id]

@app.get("/users", response_model=list[UserOut])
def search_users(city: str = None, min_age: int = None, max_age: int = None):
    filtered_users = users

    if city:
        filtered_users = [user for user in filtered_users if user.city.lower() == city.lower()]

    if min_age:
        filtered_users = [user for user in filtered_users if user.age >= min_age]

    if max_age:
        filtered_users = [user for user in filtered_users if user.age <= max_age]

    return filtered_users

In this example, we define a FastAPI application with three routes: create_user, get_user, and search_users. We use the UserIn and UserOut classes as request and response models to validate and serialize the input/output data. We also use parameters in the URL path (e.g., user_id), query string (e.g., city, min_age, max_age), and request body (e.g., user).

Tantivy

Tantivy is a full-text search engine library written in Rust. It is designed to be fast and efficient, making it a great choice for building search engines. Tantivy provides indexing and searching capabilities, allowing you to create a schema, add documents to the index, and execute search queries.

To work with Tantivy, you’ll need to create a schema, which is a description of the structure of your documents. The schema defines the fields in each document, their data types, and any additional options or settings. Once you have a schema, you can add documents to the index, store and retrieve data, and perform searches using basic or advanced query features, such as fuzzy search, filters, and pagination.

Here’s an example of creating a schema, indexing documents, and performing basic and advanced searches using Tantivy:



from tantivy import Collector, Index, QueryParser, SchemaBuilder, Term

# Create a schema
schema_builder = SchemaBuilder()
title_field = schema_builder.add_text_field("title", stored=True)
body_field = schema_builder.add_text_field("body", stored=True)
schema = schema_builder.build()

# Create an index with the schema
index = Index(schema)

# Add documents to the index
with index.writer() as writer:
    writer.add_document({"title": "First document", "body": "This is the first document."})
    writer.add_document({"title": "Second document", "body": "This is the second document."})
    writer.commit()

# Create a query parser
query_parser = QueryParser(schema, ["title", "body"])

# Basic search
query = query_parser.parse_query("first")
collector = Collector.top_docs(10)
search_result = index.searcher().search(query, collector)

print("Basic search results:")
for doc in search_result.docs():
    print(doc)

# Fuzzy search
fuzzy_query = query_parser.parse_query("frst~1")  # Allows one edit distance
fuzzy_collector = Collector.top_docs(10)
fuzzy_search_result = index.searcher().search(fuzzy_query, fuzzy_collector)

print("Fuzzy search results:")
for doc in fuzzy_search_result.docs():
    print(doc)

# Filtered search
title_term = Term(title_field, "first")
body_term = Term(body_field, "first")
filter_query = schema.new_boolean_query().add_term(title_term).add_term(body_term)
filtered_collector = Collector.top_docs(10)
filtered_search_result = index.searcher().search(filter_query, filtered_collector)

print("Filtered search results:")
for doc in filtered_search_result.docs():
    print(doc)

In this example, we first create a schema with two text fields: “title” and “body”. Then, we create an index and add documents to it. We also create a query parser to parse queries for searching the index. We demonstrate basic search, fuzzy search (with a specified edit distance), and filtered search (using boolean queries to combine terms).

Building the Search Engine

Now that we have an understanding of FastAPI and Tantivy, it’s time to build our search engine. We’ll start by designing the search engine architecture, which includes the FastAPI application and the Tantivy index.

First, create a FastAPI application by defining search and indexing endpoints. These endpoints will be responsible for processing search queries and indexing new documents, respectively. Implement request and response models for each endpoint to describe the data structure used for input and output.



from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from tantivy import Collector, Index, QueryParser, SchemaBuilder

app = FastAPI()

# Create a schema
schema_builder = SchemaBuilder()
title_field = schema_builder.add_text_field("title", stored=True)
body_field = schema_builder.add_text_field("body", stored=True)
schema = schema_builder.build()

# Create an index with the schema
index = Index(schema)

# Create a query parser
query_parser = QueryParser(schema, ["title", "body"])

class DocumentIn(BaseModel):
    title: str
    body: str

class DocumentOut(BaseModel):
    title: str
    body: str

@app.post("/index", response_model=None)
def index_document(document: DocumentIn):
    with index.writer() as writer:
        writer.add_document(document.dict())
        writer.commit()

@app.get("/search", response_model=list[DocumentOut])
def search_documents(q: str):
    query = query_parser.parse_query(q)
    collector = Collector.top_docs(10)
    search_result = index.searcher().search(query, collector)

    documents = [DocumentOut(**doc) for doc in search_result.docs()]
    return documents

In this example, we create a FastAPI application with two endpoints: index_document and search_documents. The index_document endpoint is responsible for indexing new documents, while the search_documents endpoint is responsible for processing search queries. We use the DocumentIn and DocumentOut classes as request and response models to describe the data structure for input and output.

Next, index documents using Tantivy. Write a function that processes and stores data in the Tantivy index. This function should take input data, create a document based on the schema, and add it to the index.



from typing import Dict
from tantivy import Index, SchemaBuilder

# Create a schema
schema_builder = SchemaBuilder()
title_field = schema_builder.add_text_field("title", stored=True)
body_field = schema_builder.add_text_field("body", stored=True)
schema = schema_builder.build()

# Create an index with the schema
index = Index(schema)

def index_document(document_data: Dict[str, str]) -> None:
    with index.writer() as writer:
        writer.add_document(document_data)
        writer.commit()

# Example usage
document = {"title": "Example document", "body": "This is an example document."}
index_document(document)

In this example, we define a function called index_document that takes a dictionary as input data, representing a document to be indexed. This function creates a document based on the schema and adds it to the Tantivy index. The example usage demonstrates how to use the function to index a sample document.

Finally, implement the search functionality. Use Tantivy to execute search queries, and handle search results by processing and returning them in a format that can be easily consumed by the client.



rom typing import Dict, List
from tantivy import Collector, Index, QueryParser, SchemaBuilder

# Create a schema
schema_builder = SchemaBuilder()
title_field = schema_builder.add_text_field("title", stored=True)
body_field = schema_builder.add_text_field("body", stored=True)
schema = schema_builder.build()

# Create an index with the schema
index = Index(schema)

# Create a query parser
query_parser = QueryParser(schema, ["title", "body"])

def search_documents(query_str: str) -> List[Dict[str, str]]:
    query = query_parser.parse_query(query_str)
    collector = Collector.top_docs(10)
    search_result = index.searcher().search(query, collector)

    documents = [doc.as_json() for doc in search_result.docs()]
    return documents

# Example usage
search_query = "example"
results = search_documents(search_query)
print(f"Search results for '{search_query}':")
for doc in results:
    print(doc)

In this example, we define a function called search_documents that takes a query string as input and uses Tantivy to execute the search query. The function processes the search results by converting each result document to a dictionary and returns a list of dictionaries that can be easily consumed by the client. The example usage demonstrates how to use the function to perform a search and display the results.

Testing and Optimization

To ensure that your search engine works correctly and efficiently, write unit tests for the FastAPI and Tantivy components. Test the functionality of each endpoint and the proper interaction between FastAPI and Tantivy. Additionally, benchmark your search engine to assess its performance and identify any bottlenecks or areas for improvement. Optimize your code by addressing these bottlenecks and making any necessary adjustments.

Here’s an example of unit tests for the FastAPI and Tantivy components using pytest and httpx libraries:



import pytest
import httpx
from fastapi import FastAPI
from fastapi.testclient import TestClient
from .main import app, index_document, search_documents

client = TestClient(app)

# Test FastAPI endpoints
def test_index_document():
    response = client.post("/index", json={"title": "Test document", "body": "This is a test document."})
    assert response.status_code == 200

def test_search_documents():
    response = client.get("/search", params={"q": "test"})
    assert response.status_code == 200
    assert len(response.json()) > 0
    assert response.json()[0]["title"] == "Test document"

# Test Tantivy components
def test_tantivy_index_document():
    document = {"title": "Tantivy test document", "body": "This is a Tantivy test document."}
    index_document(document)

def test_tantivy_search_documents():
    search_query = "tantivy"
    results = search_documents(search_query)
    assert len(results) > 0
    assert results[0]["title"] == "Tantivy test document"

# Performance benchmarking and optimization can be done using profiling tools,
# such as cProfile, Py-Spy, or others, depending on the specific bottlenecks and areas
# for improvement identified in your application.

To benchmark the search engine and identify bottlenecks, you can use profiling tools, such as cProfile or Py-Spy. Once you’ve identified the areas for improvement, optimize your code by addressing the bottlenecks and making necessary adjustments. Performance optimization is an iterative process and may require multiple rounds of profiling and optimization.

Deploying the Search Engine

Once your search engine is fully functional and optimized, it’s time to deploy it. Choose a deployment platform that best suits your needs, such as a cloud provider or a dedicated server. Configure the deployment environment by setting up necessary components, such as a web server, application server, and any required databases or storage systems.

After deployment, monitor and maintain your search engine to ensure its smooth operation. Keep an eye on performance metrics, such as response times and resource utilization, and address any issues that arise.

In this article, we’ve explored the process of building a search engine from scratch using FastAPI and Tantivy. We’ve covered the fundamentals of both FastAPI and Tantivy, as well as the steps needed to create, test, optimize, and deploy a custom search engine. By following this guide, you should now have a working search engine that can be tailored to your specific needs.

The possibilities with this custom search engine are vast, and you can extend its functionality to accommodate various applications, such as site search, document search, or even powering a custom search service. As you continue to experiment and explore, you’ll discover the true power and flexibility of using FastAPI and Tantivy to create search solutions that meet your unique requirements.

DEV Community