Add Natural Language Understanding to any application
Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation of the internet and an ever-growing challenge that is never solved or done.
The field of Natural Language Processing (NLP) is rapidly evolving with a number of new developments. Large-scale general language models are an exciting new capability allowing us to add amazing functionality quickly with limited compute and people. Innovation continues with new models and advancements coming in at what seems a weekly basis.
This article introduces txtai, an AI-powered search engine that enables Natural Language Understanding (NLU) based search in any application.
AI-powered search engine
txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems. txtai also has functionality for zero-shot classification.
NeuML uses txtai and/or the concepts behind it to power all of our Natural Language Processing (NLP) applications. Example applications:
- paperai - AI-powered literature discovery and review engine for medical/scientific papers
- tldrstory - AI-powered understanding of headlines and story text
- neuspo - Fact-driven, real-time sports event and news site
- codequestion - Ask coding questions directly from the terminal
txtai is built on the following stack:
The easiest way to install is via pip and PyPI
pip install txtai
You can also install txtai directly from GitHub. Using a Python Virtual Environment is recommended.
pip install git+https://github.com/neuml/txtai
Python 3.6+ is supported
Windows and macOS systems have the following prerequisites. No additional…
txtai builds an AI-powered index over sections of text. txtai supports building text indices to perform similarity searches and create extractive question-answering based systems. txtai also has functionality for zero-shot classification. txtai is open source and available on GitHub.
txtai is built on the following stack:
txtai and/or the concepts behind it has already been used to power the Natural Language Processing (NLP) applications listed below:
- paperai — AI-powered literature discovery and review engine for medical/scientific papers
- tldrstory — AI-powered understanding of headlines and story text
- neuspo — Fact-driven, real-time sports event and news site
- codequestion — Ask coding questions directly from the terminal
The following code snippet shows how to install txtai and create an embeddings model.
pip install txtai
Next, we can create a simple in memory model with a couple sample records to try txtai out.
Running the code above will print the following:
The example above shows for almost all of the queries, the actual text isn’t stored in the list of text sections. This is the true power of transformer models over token based search. What you get out of the box is 🔥🔥🔥!
For small lists of texts, the method above works. But for larger repositories of documents, it doesn’t make sense to tokenize and convert all embeddings for each query. txtai supports building pre-computed indices which significantly improves performance.
Building on the previous example, the following example runs an index method to build and store the text embeddings. In this case, only the query is converted to an embeddings vector each search.
Once again the same results will be returned, only difference is the embeddings are pre-computed.
Embeddings indices can be saved to disk and reloaded. At this time, indices are not incrementally created, the index needs a full rebuild to incorporate new data.
The results of the code above:
Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
With a limited amount of code, we’re able to build a system with a deep understanding of natural language. The amount of knowledge that comes from Transformer models is phenomenal.
txtai builds sentence embeddings to perform similarity searches. txtai takes each text record entry, tokenizes it and builds an embeddings representation of that record. At search time, the query is transformed into a text embedding and then is compared to the repository of text embeddings.
txtai supports two methods for creating text embeddings, sentence transformers and word embeddings vectors. Both methods have their merits as shown below.
- Creates a single embeddings vector via mean pooling of vectors generated by the Transformers library.
- Supports models stored on Hugging Face’s model hub or stored locally.
- See Sentence Transformers for details on how to create custom models, which can be kept local or uploaded to Hugging Face’s model hub.
- Base models require significant compute capability (GPU preferred). Possible to build smaller/lighter weight models that trade off accuracy for speed.
- Creates a single embeddings vector via BM25 scoring of each word component. Reference above describes this method in detail.
- Backed by the pymagnitude library. Pre-trained word vectors can be installed from the referenced link.
- See vectors.py for code that can build word vectors for custom datasets.
- Significantly better speed with default models. For larger datasets, it offers a good trade off of speed and accuracy.
As discussed above, txtai uses similarity search to compare a sentence embeddings against all sentence embeddings in the repository. The first question that may come to mind is how would that scale to millions or billions of records? The answer is with Approximate Nearest Neighbor (ANN) search. ANN enables efficient execution of similarity queries over a large corpus of data.
A number of robust libraries are available in Python that enable ANN search. txtai has a configurable index backend that allows plugging in different ANN libraries. At this time, txtai supports:
txtai uses sensible default settings for each of the libraries above, to make it as easy as possible to get up and running. The selection of the index is abstracted by default, based on the target environment.
The libraries above either don’t have a method for associating embeddings with record ids or assume the id is an integer. txtai takes care of that and keeps an internal id mapping, which allows any id type.
Benchmarks for each of the supported systems (and others) can help guide what ANN is the best fit for a given dataset. There are also platform differences, for example Faiss is only supported for Linux and macOS.
In addition to similarity search, txtai supports extractive question-answering over returned results. This powerful feature enables asking another series of questions for a list of search results.
An example use case of this is with the CORD-19 challenge on Kaggle. This effort required creating summary tables for a series of medical queries, extracting additional columns for each result.
The following shows how to create an Extractive QA component within txtai.
Next step is to load a set of results to ask questions on. The following example has text snippets with sports scores covering a series of games.
Results for the section above.
We can see the extractor was able to understand the context of the sections above and is able to answer related questions. The Extractor component can work with a txtai Embeddings index as well as with external data stores. This modularity allows us to pick and choose what functionality to use from txtai to create natural language aware search systems.
More detailed examples and use cases for txtai can be found in the following notebooks.
NLP is advancing at a rapid pace and things not possible even a year ago are now possible. This article introduced txtai, an AI-powered search engine, that enables quick integration of robust models with a deep understanding of natural language. Hugging Face’s model hub has a number of base and community-provided models that can be used to customize search for almost any dataset. The possibilities are limitless and we’re excited to see what can built on top of txtai!