DEV Community

loading...
NeuML

Build an Embeddings index with Hugging Face Datasets

David Mezzetti
Founder/CEO at NeuML — applying machine learning to solve everyday problems. Previously co-founded and built Data Works into a successful IT services company.
・4 min read

This article is part of a tutorial series on txtai, an AI-powered search engine.

This article shows how txtai can index and search with Hugging Face's Datasets library. Datasets opens access to a large and growing list of publicly available datasets. Datasets has functionality to select, transform and filter data stored in each dataset.

In this example, txtai will be used to index and query a dataset.

Install dependencies

Install txtai and all dependencies. Also install datasets and a specific sentence-transformers model that does well with general information retrieval tasks.

pip install txtai
pip install datasets

# Download sentence-transformer models not on Hugging Face model hub
wget https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/msmarco-distilroberta-base-v2.zip
unzip -o msmarco-distilroberta-base-v2.zip
mv 0_Transformer/ msmarco-distilroberta-base-v2
Enter fullscreen mode Exit fullscreen mode

Load dataset and build a txtai index

In this example, we'll load the ag_news dataset, which is a collection of news article headlines. This only takes a single line of code!

Next, txtai will index the first 10,000 rows of the dataset. A model trained on msmarco is used to compute sentence embeddings. sentence-transformers has a number of pre-trained models that can be swapped in.

In addition to the embeddings index, we'll also create a Similarity instance to re-rank search hits for relevancy.

from datasets import load_dataset

from txtai.embeddings import Embeddings
from txtai.pipeline import Similarity

def stream(dataset, field, limit):
  index = 0
  for row in dataset:
    yield (index, row[field], None)
    index += 1

    if index >= limit:
      break

def search(query):
  return [(score, dataset[uid]["text"]) for uid, score in embeddings.search(query, limit=50)]

def ranksearch(query):
  results = [text for _, text in search(query)]
  return [(score, results[x]) for x, score in similarity(query, results)]

# Load HF dataset
dataset = load_dataset("ag_news", split="train")

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "msmarco-distilroberta-base-v2"})
embeddings.index(stream(dataset, "text", 10000))

# Create similarity instance for re-ranking
similarity = Similarity("valhalla/distilbart-mnli-12-3")
Enter fullscreen mode Exit fullscreen mode

Search the dataset

Now that an index is ready, let's search the data! The following section runs a series of queries and show the results. Like basic search engines, txtai finds token matches. But the real power of txtai is finding semantically similar results.

sentence-transformers has a great overview on information retrieval that is well worth a read.

from IPython.core.display import display, HTML

def table(query, rows):
    html = """
    <style type='text/css'>
    @import url('https://fonts.googleapis.com/css?family=Oswald&display=swap');
    table {
      border-collapse: collapse;
      width: 900px;
    }
    th, td {
        border: 1px solid #9e9e9e;
        padding: 10px;
        font: 15px Oswald;
    }
    </style>
    """

    html += "<h3>%s</h3><table><thead><tr><th>Score</th><th>Text</th></tr></thead>" % (query)
    for score, text in rows:
        html += "<tr><td>%.4f</td><td>%s</td></tr>" % (score, text)
    html += "</table>"

    display(HTML(html))

for query in ["Positive Apple reports", "Negative Apple reports", "Best planets to explore for life", "LA Dodgers good news", "LA Dodgers bad news"]:
  table(query, ranksearch(query)[:2])
Enter fullscreen mode Exit fullscreen mode

Positive Apple reports

Score Text
0.9886 Apple tops US consumer satisfaction Recent data published by the American Customer Satisfaction Index (ACSI) shows Apple leading the consumer computer industry with the the highest customer satisfaction.
0.9876 Apple Remote Desktop 2 Reviewing Apple Remote Desktop 2 in Computerworld, Yuval Kossovsky writes, #147;I liked what I found. #148; He concludes, #147;I am happy to say that ARD 2 is an excellent upgrade and well worth the money. #148; Aug 19

Negative Apple reports

Score Text
0.9847 Apple Recalls 28,000 Faulty Batteries Sold with 15-inch PowerBook Apple has had to recall up to 28,000 notebook batteries that were sold for use with their 15-inch PowerBook. Apple reports that faulty batteries sold between January 2004 and August 2004 can overheat and pose a fire hazard.
0.9733 Apple warns about bad batteries Apple is recalling 28,000 faulty batteries for its 15-inch Powerbook G4 laptops.

Best planets to explore for life

Score Text
0.9110 Tiny 'David' Telescope Finds 'Goliath' Planet A newfound planet detected by a small, 4-inch-diameter telescope demonstrates that we are at the cusp of a new age of planet discovery. Soon, new worlds may be located at an accelerating pace, bringing the detection of the first Earth-sized world one step closer.
0.8705 Life on Mars Likely, Scientist Claims (SPACE.com) SPACE.com - DENVER, COLORADO -- Those twin robots hard at work on Mars have transmitted teasing views that reinforce the prospect that microbial life may exist on the red planet.

LA Dodgers good news

Score Text
0.9990 Green's Slam Lifts L.A. Shawn Green connects on a grand slam and a solo homer to lead the Los Angeles Dodgers past the Atlanta Braves 7-4 on Saturday.
0.9961 Dodgers 7, Braves 4 Los Angeles, Ca. -- Shawn Green belted a grand slam and a solo homer as Los Angeles beat Mike Hampton and the Atlanta Braves 7-to-4 Saturday afternoon.

LA Dodgers bad news

Score Text
0.9880 Expos Keep Dodgers at Bay With 8-7 Win (AP) AP - Giovanni Carrara walked Juan Rivera with the bases loaded and two outs in the ninth inning Monday night, spoiling Los Angeles' six-run comeback and handing the Montreal Expos an 8-7 victory over the Dodgers.
0.9671 Gagne blows his 2d save Pinch-hitter Lenny Harris delivered a three-run double off Eric Gagne with two outs in the ninth, rallying the Florida Marlins past the Dodgers, 6-4, last night in Los Angeles.

Discussion (3)

Collapse
davidmezzetti profile image
David Mezzetti Author

Glad it was helpful!