DEV Community

loading...
NeuML

Tutorial series on txtai

davidmezzetti profile image David Mezzetti Updated on ・5 min read

txtai

This tutorial series will cover the main use cases for txtai, an AI-powered search engine. Each part in the series has a corresponding notebook that can fully reproduce each article.

Introducing txtai

txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems. txtai also has functionality for zero-shot classification.

NeuML uses txtai and/or the concepts behind it to power all of our Natural Language Processing (NLP) applications. Example applications:

  • paperai - AI-powered literature discovery and review engine for medical/scientific papers
  • tldrstory - AI-powered understanding of headlines and story text
  • neuspo - Fact-driven, real-time sports event and news site
  • codequestion - Ask coding questions directly from the terminal

txtai is built on the following stack:

This article gives an overview of txtai and how to run similarity searches.

Install dependencies

Install txtai and all dependencies.

pip install txtai
Enter fullscreen mode Exit fullscreen mode

Create an Embeddings instance

The Embeddings instance is the main entrypoint for txtai. An Embeddings instance defines the method used to tokenize and convert a text section into an embeddings vector.

from txtai.embeddings import Embeddings

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})
Enter fullscreen mode Exit fullscreen mode

Running similarity queries

An embedding instance relies on the underlying transformer model to build text embeddings. The following example shows how to use an transformers Embedding instance to run similarity searches for a list of different concepts.

data = ["US tops 5 million confirmed virus cases",
        "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
        "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
        "The National Park Service warns against sacrificing slower friends in a bear attack",
        "Maine man wins $1M from $25 lottery ticket",
        "Make huge profits without work, earn up to $100,000 a day"]

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
    # Get index of best section that best matches query
    uid = embeddings.similarity(query, data)[0][0]

    print("%-20s %s" % (query, data[uid]))
Enter fullscreen mode Exit fullscreen mode
Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
north america        US tops 5 million confirmed virus cases
dishonest junk       Make huge profits without work, earn up to $100,000 a day
Enter fullscreen mode Exit fullscreen mode

The example above shows for almost all of the queries, the actual text isn't stored in the list of text sections. This is the true power of transformer models over token based search. What you get out of the box is 🔥🔥🔥!

Building an Embeddings index

For small lists of texts, the method above works. But for larger repositories of documents, it doesn't make sense to tokenize and convert to embeddings on each query. txtai supports building pre-computed indices which signficantly improve performance.

Building on the previous example, the following example runs an index method to build and store the text embeddings. In this case, only the query is converted to an embeddings vector each search.

# Create an index for the list of text
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

print("%-20s %s" % ("Query", "Best Match"))
print("-" * 50)

# Run an embeddings search for each query
for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):
    # Extract uid of first result
    # search result format: (uid, score)
    uid = embeddings.search(query, 1)[0][0]

    # Print text
    print("%-20s %s" % (query, data[uid]))
Enter fullscreen mode Exit fullscreen mode
Query                Best Match
--------------------------------------------------
feel good story      Maine man wins $1M from $25 lottery ticket
climate change       Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
health               US tops 5 million confirmed virus cases
war                  Beijing mobilises invasion craft along coast as Taiwan tensions escalate
wildlife             The National Park Service warns against sacrificing slower friends in a bear attack
asia                 Beijing mobilises invasion craft along coast as Taiwan tensions escalate
north america        US tops 5 million confirmed virus cases
dishonest junk       Make huge profits without work, earn up to $100,000 a day
Enter fullscreen mode Exit fullscreen mode

Embeddings load/save

Embeddings indices can be saved to disk and reloaded. At this time, indices are not incrementally created, the index needs a full rebuild to incorporate new data. But that enhancement is in the backlog.

embeddings.save("index")

embeddings = Embeddings()
embeddings.load("index")

uid = embeddings.search("climate change", 1)[0][0]
print(data[uid])
Enter fullscreen mode Exit fullscreen mode
Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
Enter fullscreen mode Exit fullscreen mode
ls index
Enter fullscreen mode Exit fullscreen mode
config  embeddings
Enter fullscreen mode Exit fullscreen mode

Embeddings update/delete

Updates and deletes are supported for Embedding indices. The upsert operation will insert new data and update existing data

The following section runs a query, then updates a value changing the top result and finally deletes the updated value to revert back to the original query results.

# Run initial query
uid = embeddings.search("feel good story", 1)[0][0]
print("Initial: ", data[uid])

# Update data
data[0] = "Feel good story: baby panda born"
embeddings.upsert([(0, data[0], None)])

uid = embeddings.search("feel good story", 1)[0][0]
print("After update: ", data[uid])

# Remove record just added from index
embeddings.delete([0])

# Ensure value matches previous value
uid = embeddings.search("feel good story", 1)[0][0]
print("After delete: ", data[uid])
Enter fullscreen mode Exit fullscreen mode
Initial:  Maine man wins $1M from $25 lottery ticket
After update:  Feel good story: baby panda born
After delete:  Maine man wins $1M from $25 lottery ticket
Enter fullscreen mode Exit fullscreen mode

Embedding methods

Embeddings supports two methods for creating text vectors, the sentence-transformers library and word embeddings vectors. Both methods have their merits as shown below:

  • sentence-transformers
    • Creates a single embeddings vector via mean pooling of vectors generated by the transformers library.
    • Supports models stored on Hugging Face's model hub or stored locally.
    • See sentence-transformers for details on how to create custom models, which can be kept local or uploaded to Hugging Face's model hub.
    • Base models require significant compute capability (GPU preferred). Possible to build smaller/lighter weight models that tradeoff accuracy for speed.
  • word embeddings
    • Creates a single embeddings vector via BM25 scoring of each word component. See this Medium article for the logic behind this method.
    • Backed by the pymagnitude library. Pre-trained word vectors can be installed from the referenced link.
    • See words.py for code that can build word vectors for custom datasets.
    • Significantly better performance with default models. For larger datasets, it offers a good tradeoff of speed and accuracy.

Discussion (0)

Forem Open with the Forem app