DEV Community


Run machine-learning workflows to transform data and build AI-powered text indices with txtai

David Mezzetti
Founder/CEO at NeuML — applying machine learning to solve everyday problems. Previously co-founded and built Data Works into a successful IT services company.
Originally published at Updated on ・5 min read

txtai workflows

txtai executes machine-learning workflows to transform data and build AI-powered text indices to perform similarity search. txtai supports indexing text snippets, documents, audio and images. Pipelines and workflows enable transforming data with machine-learning models. An introduction to txtai is available in the article below.

Introducing txtai, an AI-powered search engine built on Transformers

Since the initial release of txtai back in August 2020, txtai has grown considerably. In addition to building embedding indices, txtai now supports transformations to prepare data for indexing through pipelines, workflows to join pipelines together, API bindings for JavaScript/Java/Rust/Go and the ability to scale out processing. This article will cover methods to vectorize data, machine-learning pipelines and workflows.

Vectorize data

txtai initially supported building indices over sections of text. txtai now supports documents, audio and images. Documents and audio will be shown below in the pipelines sections. This section will show how to vectorize images and run a similarity search.

sentence-transformers recently added support for the OpenAI CLIP model. This model embeds text and images into the same space, enabling image similarity search. txtai can directly utilize these models.

The code above builds a similarity index of a directory of images and searches using a query. Run it against your own images and explore the results!


txtai has a growing list of models available through it’s pipeline framework. Pipelines wrap a machine learning model and transform data. Currently, pipelines can wrap Hugging Face Transformers models, Hugging Face Transformers pipelines or PyTorch models (support for TensorFlow is in the backlog).

The following is a list of the currently implemented pipelines.

  • Questions - Answer questions using a text context

  • Labels - Apply labels to text using a zero-shot classification model. Also supports similarity comparisons.

  • Summary - Abstractive text summarization

  • Textractor - Extract text from documents

  • Transcription - Transcribe audio to text

  • Translation - Machine translation

Pipelines take input data, apply NLP transformations and return results. The following notebooks go through examples of each of the pipelines above.

Abstractive Summarization

Abstractive summarization uses Natural Language Processing (NLP) models to build transformative summaries of text. This is similar to having a human read an article and asking what was it about. A human wouldn't just give a verbose reading of the text. Let’s look at an example.

The section above prints:

Search is the foundation of the internet
Enter fullscreen mode Exit fullscreen mode

A full example can be found in the notebook linked below.

Build abstractive text summaries

Text Extraction

This section shows how documents can have text extracted to best support similarity search.

The section above prints:

Introducing txtai, an AI-powered search engine built on Transformers Add Natural Language Understanding to any application Search is the base of many applications. Once data starts to pile up, users want to be able to find it. It’s the foundation.....
Enter fullscreen mode Exit fullscreen mode

A full example can be found in the notebook linked below. This example shows how text can be split/segmented to assist with building sections of text to index.

Extract text from documents

Audio Transcription

Hugging Face Transformers provides a number of models that can perform audio transcription (audio to text).

The section above prints:

Make huge profits without working make up to one hundred thousand dollars a day
Enter fullscreen mode Exit fullscreen mode

A full example can be found in the notebook linked below.

Transcribe audio to text

Translate text between languages

This section covers machine translation backed by Hugging Face Transformer models. The quality of machine translation via cloud services has come a very long way and produces high quality results. The following shows how local models can give developers a reasonable alternative.

The section above prints:

Esta es una traducción de prueba al español
Enter fullscreen mode Exit fullscreen mode

A full example can be found in the notebook linked below.

Transcribe text between languages


Pipelines are great and make using a variety of machine learning models easier. But what if we want to glue the results of different pipelines together? For example, extract text, summarize it, translate it to English and load it into an Embedding index. That would require code to join those operations together in an efficient manner.

Enter workflows. Workflows are a simple yet powerful construct that takes a callable and returns elements. Workflows don’t know they are working with pipelines but enable efficient processing of pipeline data. Workflows are streaming by nature and work on data in batches, allowing large volumes of data to be processed efficiently.

The example above transcribes audio to text then translates the text to French.

["Les cas de virus U sont en tête d'un million",
 "La dernière plate-forme de glace entièrement intacte du Canada s'est soudainement effondrée en formant un berge de glace de taille manhatten",
 "Bagage mobilise les embarcations d'invasion le long des côtes à mesure que les tensions tiwaniennes s'intensifient",
 "Le service des parcs nationaux met en garde contre le sacrifice d'amis plus lents dans une attaque nue",
 "L'homme principal gagne du billet de loterie",
 "Faire d'énormes profits sans travailler faire jusqu'à cent mille dollars par jour"]
Enter fullscreen mode Exit fullscreen mode

This example and additional examples including a complex workflow that summarizes text, translates the text to French and then builds an Embedding index can be found in the notebook below.

Run pipeline workflows

Wrapping up

All of the functionality discussed is now available in the main branch on GitHub and will be in the upcoming v3.0 release. txtai continues to rapidly evolve and there will be a continued focus on adding new pipelines. The ability to horizontally scale out at the pipeline and workflow level is also a continuing area of development.

The goal for txtai is to be simple enough to work on a laptop but able to scale out to clustered/cloud systems.

Discussion (0)