David Mezzetti for NeuML

Posted on Oct 13, 2022 • Edited on Apr 25 • Originally published at neuml.hashnode.dev

Classic Topic Modeling with BM25

#ai #llm #rag #vectordatabase

txtai 5.0 introduced topic modeling via semantic graphs. Semantic graphs can be easily integrated into an embeddings instance to add topic modeling to a txtai index.

In addition to transformers-backed models, txtai also has support for traditional indexing methods. Given the modular design of txtai, traditional scoring methods like BM25 can be combined with graphs to build topic models.

This article is all classic Python code on the CPU. No GPUs or machine learning models required!

Install dependencies

Install txtai and all dependencies.

# Install txtai
pip install txtai[graph] datasets

Load dataset

This example will use the ag_news dataset, which is a collection of news article headlines.

from datasets import load_dataset

dataset = load_dataset("ag_news", split="train")

Build BM25 Index

Since the original txtai release, there has been a scoring package. This package supports building standalone BM25, TF-IDF and/or SIF text indexes.

from txtai.scoring import ScoringFactory

# List of all text elements
texts = dataset["text"]

# Build index
scoring = ScoringFactory.create({"method": "bm25", "terms": True})
scoring.index((x, text, None) for x, text in enumerate(texts))

# Show total
scoring.count()

Let's test the index.

for id, score in scoring.search("planets explore life earth", 3):
  print(id, texts[id], score)

16327 3 Planets Are Found Close in Size to Earth, Making Scientists Think 'Life' A trio of newly discovered worlds are much smaller than any other planets previously discovered outside of the solar system. 20.72295380862701
16158 Earth #39;s  #39;big brothers #39; floating around stars Washington - A new class of planets has been found orbiting stars besides our sun, in a possible giant leap forward in the search for Earth-like planets that might harbour life. 19.917461045326878
16620 New Planets could advance search for Life Astronomers in Europe and the United States have found two new planets about 20 times the size of Earth beyond the solar system. The discovery might be a giant leap forward in  19.917461045326878

Results look as expected. BM25 returns keyword-based results vs contextual matches.

Build topic model

Now that we have a scoring index, we'll use it to build a graph.

Graphs have built-in methods to insert nodes and build a relationship index between the nodes. The index method takes a search parameter that can be any function that returns (id, score) pairs. This logic is built into embeddings instances.

Graphs constructed via a BM25 index will have more literal relationships. In other words, it will be keyword-driven. Semantic graphs backed by embeddings will have contextual relationships.

The next section builds a graph to support topic modeling. We'll use a multiprocessing pool to maximize CPU usage.

import os

from multiprocessing import Pool

from txtai.graph import GraphFactory

# Multiprocessing helper methods
SCORING = None

def create(search):
    global SCORING

    # Create a global scoring object
    SCORING = search

def run(params):
    query, limit = params
    return SCORING.search(query, limit)

def batchsearch(queries, limit):
    return pool.imap(run, [(query, limit) for query in queries])

# Build the graph
pool = None
with Pool(os.cpu_count(), initializer=create, initargs=(scoring,)) as pool:
    graph = GraphFactory.create({"topics": {}})
    graph.insert((x, text, None) for x, text in enumerate(texts))
    graph.index(batchsearch, None)

Let's list the top 10 topics. Keep in mind this dataset is from 2004.

list(graph.topics)[:10]

['kerry_bush_john_president',
 'nhl_players_league_lockout',
 'arafat_yasser_palestinian_leader',
 'sharon_ariel_prime_minister',
 'blair_tony_minister_prime',
 'xp_windows_microsoft_sp2',
 'athens_gold_medal_olympic',
 'space_prize_million_spaceshipone',
 'nikkei_tokyo_reuters_average',
 'hostage_british_bigley_iraq']

Topics map a list of ids for each matching text element ordered by topic relevance. Let's print the most relevant text element for a topic.

uid = graph.topics["xp_windows_microsoft_sp2"][0]
graph.attribute(uid, "text")

Microsoft continues Windows XP SP2 distribution Continuing the roll-out of Windows XP Service Pack 2 (SP2), Microsoft Corp. on Wednesday began pushing the security-focused update to PCs running Windows XP Professional Edition

Graph analysis

Given this is a standard txtai graph, analysis methods such as centrality and pagerank are available.

centrality = list(graph.centrality().keys())
print("Top connection count:", [len(graph.edges(uid)) for uid in centrality[:5]], "\n")

# Print most central node/topic
print("Most central node:", graph.attribute(centrality[0], "text"))

topic = graph.attribute(centrality[0], "topic")
for uid in graph.topics[topic][:3]:
  print("->", graph.attribute(uid, "text"))

Top connection count: [30, 30, 28, 28, 28] 

Most central node: Manning Gets Chance to Start Giants Coach Tom Coughlin announced that rookie quarterback Eli Manning will start ahead of two-time M.V.P. Kurt Warner in Thursday's preseason game against Carolina.
-> Manning Replaces Warner As Giants QB (AP) AP - Eli Manning has replaced Kurt Warner as the New York Giants' starting quarterback.
-> Eli Manning replaces Warner at quarterback Eli Manning, the top pick in this year #39;s NFL draft, has been named the starting quarterback of the New York Giants. Coach Tom Coughlin made the announcement at a Monday news conference.
-> Giants to Start Manning Against Carolina (AP) AP - Eli Manning is going to get a chance to open the season as the New York Giants' starting quarterback.

Notice the correlation between the number of connections and centrality.

Given that BM25 is keyword-driven, we expect that the most central node would be text that is duplicative in nature. And that is the case here.

Walk the graph

Just like semantic graphs, relationship paths can be explored.

from IPython.display import HTML

def showpath(source, target):
  path = graph.showpath(source, target)
  path = [graph.attribute(p, "text") for p in path]

  sections = []
  for x, p in enumerate(path):
    # Print start node
    sections.append(f"{x + 1}. {p}")

  return HTML("<br/><br/>".join(sections))

showpath(83978, 8107)

1. NFL Game Summary - NY Jets at Buffalo Orchard Park, NY (Sports Network) - Willis McGahee ran for 132 yards and a touchdown to lead the Buffalo Bills to a 22-17 victory over the New York Jets at Ralph Wilson Stadium.

2. NCAA Game Summary - Marshall at Georgia Athens, GA (Sports Network) - Michael Cooper ran for the only touchdown of the game, as third-ranked Georgia rode its defense to a 13-3 victory over Marshall at Sanford Stadium.

3. NCAA Game Summary - Northwestern at Wisconsin Madison, WI (Sports Network) - Anthony Davis ran for 122 yards and two touchdowns to lead No. 6 Wisconsin over Northwestern, 24-12, to celebrate Homecoming weekend at Camp Randall Stadium.

4. NCAA Top 25 Game Summary - Northwestern at Minnesota The last time Minnesota won four games to start three consecutive seasons was 1934-36...Chris Malleo replaced Basanez for two series in the third quarter for his first career appearance.

5. UConn ousts Marist Sophomore Steve Sealy netted his third winning goal in the last four games, giving Connecticut a 2-1 overtime victory over Marist yesterday in an NCAA Division 1 first-round men's soccer playoff game at Morrone Stadium in Storrs, Conn.

6. United States upsets Germany to move to soccer semifinals Deep into overtime, and maybe the last time for the Fab Five of US women #39;s soccer, the breaks were going against them. A last-gasp goal that stole victory in regulation, a wide-open shot that bounced off the goal post.

Notice how the data pivots from the start node to the end node. If you've read the Introducing the Semantic Graph article, you'll notice how this traversal is more literal in nature. In other words, the relationships are keyword-driven vs contextual.

Wrapping up

This article demonstrated how graphs can index traditional indexes such as BM25. This method can also be applied to an external index provided a search function is available to build connections.

Semantic graphs backed by embeddings instances have a number of advantages and are recommended in most cases. But this is a classic way to do it - no machine learning models required!

DEV Community

Classic Topic Modeling with BM25

Install dependencies

Load dataset

Build BM25 Index

Build topic model

Graph analysis

Walk the graph

Wrapping up

Top comments (0)

Read next

Introduction to Hadoop:)

VoiceScribe: Elevating Transcriptions with AssemblyAI's Universal-2 Model

Negative Eigenvalues Boost Neural Networks' Memory and Pattern Recognition Abilities

Day 40: Constrained Decoding with LLMs