David Mezzetti for NeuML

Posted on May 6, 2022 • Edited on Apr 25, 2024 • Originally published at neuml.hashnode.dev

Build a QA database

#vectordatabase #ai #llm #rag

Conversational AI is a growing field that could potentially automate much of the customer service industry. Full automation is still a ways away (most of us have been on a call with an automated agent and just want to get to a person) but it certainly can be a solid first line before human intervention.

This article presents a process to answer user questions using a txtai embeddings instance. It's not conversational AI but instead looks to find the closest existing question to a user question. This is useful in cases where there are a list of frequently asked questions.

Install dependencies

Install txtai and all dependencies.

pip install txtai datasets

Load the dataset

We'll use a Hugging Face dataset of web questions for this example. The dataset has a list of questions and answers. The code below loads the dataset and prints a couple examples to get an idea of how the data is formatted.

from datasets import load_dataset

ds = load_dataset("web_questions", split="train")

for row in ds.select(range(5)):
  print(row["question"], row["answers"])

what is the name of justin bieber brother? ['Jazmyn Bieber', 'Jaxon Bieber']
what character did natalie portman play in star wars? ['Padmé Amidala']
what state does selena gomez? ['New York City']
what country is the grand bahama island in? ['Bahamas']
what kind of money to take to bahamas? ['Bahamian dollar']

Create index

Next, we'll create a txtai index. The question will be the indexed text. We'll also store full content so we can access the answer at query time.

from txtai.embeddings import Embeddings

# Create embeddings index with content enabled. The default behavior is to only store indexed vectors.
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2", "content": True})

# Map question to text and store content
embeddings.index([(uid, {"url": row["url"], "text": row["question"], "answer": ", ".join(row["answers"])}, None) for uid, row in enumerate(ds)])

Asking questions

Now that the index is built, let's ask some questions! We'll use txtai SQL to select the fields we want to return.

See the list of questions asked and best matching question-answer combo.

def question(text):
  return embeddings.search(f"select text, answer, score from txtai where similar('{text}') limit 1")

question("What is the timezone of NYC?")

[{'answer': 'North American Eastern Time Zone',
  'score': 0.8904051184654236,
  'text': 'what time zone is new york under?'}]

question("Things to do in New York")

[{'answer': "Chelsea Art Museum, Brooklyn Bridge, Empire State Building, The Broadway Theatre, American Museum of Natural History, Central Park, St. Patrick's Cathedral, Japan Society of New York, FusionArts Museum, American Folk Art Museum",
  'score': 0.8308358192443848,
  'text': 'what are some places to visit in new york?'}]

question("Microsoft founder")

[{'answer': 'Bill Gates',
  'score': 0.6617322564125061,
  'text': 'who created microsoft windows?'}]

question("Apple founder university")

[{'answer': 'Reed College',
  'score': 0.5137897729873657,
  'text': 'what college did steve jobs attend?'}]

question("What country uses the Yen?")

{'answer': 'Japanese yen',
  'score': 0.6663530468940735,
  'text': 'what money do japanese use?'}]

question("Show me a list of Pixar movies")

[{'answer': "A Bug's Life, Toy Story 2, Ratatouille, Cars, Up, Toy Story, Monsters, Inc., The Incredibles, Finding Nemo, WALL-E",
  'score': 0.653051495552063,
  'text': 'what does pixar produce?'}]

question("What is the timezone of Florida?")

[{'answer': 'North American Eastern Time Zone',
  'score': 0.9672279357910156,
  'text': 'where is the time zone in florida?'}]

question("Tell me an animal found offshore in Florida")

[{'answer': 'Largemouth bass',
  'score': 0.6526554822921753,
  'text': 'what kind of fish do you catch in florida?'}]

Not too bad! This database only has over 6,000 question-answer pairs. To improve quality a score filter could be put on the query to only return highly confident answers. But this gives an idea of what is possible.

Run as an application

This can also be run as an application. See below.

from txtai.app import Application

# Save index
embeddings.save("questions.tar.gz")

# Build application and index data
app = Application("path: questions.tar.gz")

# Run search query
app.search("select text, answer, score from txtai where similar('Tell me an animal found offshore in Florida') limit 1")[0]

{'answer': 'Largemouth bass',
 'score': 0.6526554822921753,
 'text': 'what kind of fish do you catch in florida?'}

Wrapping up

This article introduced a simple question matching service. This could be the foundation of an automated customer service agent and/or an online FAQ.

For a full example, see codequestion, which is an application that matches user questions to Stack Overflow question-answer pairs.

DEV Community