DEV Community

David Mezzetti for NeuML

Posted on • Updated on • Originally published at neuml.hashnode.dev

Build a QA database

This article is part of a tutorial series on txtai, an AI-powered semantic search platform.

txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications.

Conversational AI is a growing field that could potentially automate much of the customer service industry. Full automation is still a ways away (most of us have been on a call with an automated agent and just want to get to a person) but it certainly can be a solid first line before human intervention.

This article presents a process to answer user questions using a txtai embeddings instance. It's not conversational AI but instead looks to find the closest existing question to a user question. This is useful in cases where there are a list of frequently asked questions.

Install dependencies

Install txtai and all dependencies.

pip install txtai datasets
Enter fullscreen mode Exit fullscreen mode

Load the dataset

We'll use a Hugging Face dataset of web questions for this example. The dataset has a list of questions and answers. The code below loads the dataset and prints a couple examples to get an idea of how the data is formatted.

from datasets import load_dataset

ds = load_dataset("web_questions", split="train")

for row in ds.select(range(5)):
  print(row["question"], row["answers"])
Enter fullscreen mode Exit fullscreen mode
what is the name of justin bieber brother? ['Jazmyn Bieber', 'Jaxon Bieber']
what character did natalie portman play in star wars? ['Padmé Amidala']
what state does selena gomez? ['New York City']
what country is the grand bahama island in? ['Bahamas']
what kind of money to take to bahamas? ['Bahamian dollar']
Enter fullscreen mode Exit fullscreen mode

Create index

Next, we'll create a txtai index. The question will be the indexed text. We'll also store full content so we can access the answer at query time.

from txtai.embeddings import Embeddings

# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2", "content": True})

# Map question to text and store content
embeddings.index([(uid, {"url": row["url"], "text": row["question"], "answer": ", ".join(row["answers"])}, None) for uid, row in enumerate(ds)])
Enter fullscreen mode Exit fullscreen mode

Asking questions

Now that the index is built, let's ask some questions! We'll use txtai SQL to select the fields we want to return.

See the list of questions asked and best matching question-answer combo.

def question(text):
  return embeddings.search(f"select text, answer, score from txtai where similar('{text}') limit 1")

question("What is the timezone of NYC?")
Enter fullscreen mode Exit fullscreen mode
[{'answer': 'North American Eastern Time Zone',
  'score': 0.8904051184654236,
  'text': 'what time zone is new york under?'}]
Enter fullscreen mode Exit fullscreen mode
question("Things to do in New York")
Enter fullscreen mode Exit fullscreen mode
[{'answer': "Chelsea Art Museum, Brooklyn Bridge, Empire State Building, The Broadway Theatre, American Museum of Natural History, Central Park, St. Patrick's Cathedral, Japan Society of New York, FusionArts Museum, American Folk Art Museum",
  'score': 0.8308358192443848,
  'text': 'what are some places to visit in new york?'}]
Enter fullscreen mode Exit fullscreen mode
question("Microsoft founder")
Enter fullscreen mode Exit fullscreen mode
[{'answer': 'Bill Gates',
  'score': 0.6617322564125061,
  'text': 'who created microsoft windows?'}]
Enter fullscreen mode Exit fullscreen mode
question("Apple founder university")
Enter fullscreen mode Exit fullscreen mode
[{'answer': 'Reed College',
  'score': 0.5137897729873657,
  'text': 'what college did steve jobs attend?'}]
Enter fullscreen mode Exit fullscreen mode
question("What country uses the Yen?")
Enter fullscreen mode Exit fullscreen mode
{'answer': 'Japanese yen',
  'score': 0.6663530468940735,
  'text': 'what money do japanese use?'}]
Enter fullscreen mode Exit fullscreen mode
question("Show me a list of Pixar movies")
Enter fullscreen mode Exit fullscreen mode
[{'answer': "A Bug's Life, Toy Story 2, Ratatouille, Cars, Up, Toy Story, Monsters, Inc., The Incredibles, Finding Nemo, WALL-E",
  'score': 0.653051495552063,
  'text': 'what does pixar produce?'}]
Enter fullscreen mode Exit fullscreen mode
question("What is the timezone of Florida?")
Enter fullscreen mode Exit fullscreen mode
[{'answer': 'North American Eastern Time Zone',
  'score': 0.9672279357910156,
  'text': 'where is the time zone in florida?'}]
Enter fullscreen mode Exit fullscreen mode
question("Tell me an animal found offshore in Florida")
Enter fullscreen mode Exit fullscreen mode
[{'answer': 'Largemouth bass',
  'score': 0.6526554822921753,
  'text': 'what kind of fish do you catch in florida?'}]
Enter fullscreen mode Exit fullscreen mode

Not too bad! This database only has over 6,000 question-answer pairs. To improve quality a score filter could be put on the query to only return highly confident answers. But this gives an idea of what is possible.

Run as an application

This can also be run as an application. See below.

from txtai.app import Application

# Save index
embeddings.save("questions.tar.gz")

# Build application and index data
app = Application("path: questions.tar.gz")

# Run search query
app.search("select text, answer, score from txtai where similar('Tell me an animal found offshore in Florida') limit 1")[0]
Enter fullscreen mode Exit fullscreen mode
{'answer': 'Largemouth bass',
 'score': 0.6526554822921753,
 'text': 'what kind of fish do you catch in florida?'}
Enter fullscreen mode Exit fullscreen mode

Wrapping up

This article introduced a simple question matching service. This could be the foundation of an automated customer service agent and/or an online FAQ.

For a full example, see codequestion, which is an application that matches user questions to Stack Overflow question-answer pairs.

Discussion (0)