David Mezzetti for NeuML

Posted on Jan 28, 2021 • Edited on Apr 25 • Originally published at neuml.hashnode.dev

Extractive QA with Elasticsearch

#ai #llm #rag #vectordatabase

txtai is datastore agnostic, the library analyzes sets of text. The following example shows how extractive question-answering can be added on top of an Elasticsearch system.

Install dependencies

Install txtai and Elasticsearch.

# Install txtai and elasticsearch python client
pip install txtai elasticsearch

# Download and extract elasticsearch
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.10.1

Start an instance of Elasticsearch.

import os
from subprocess import Popen, PIPE, STDOUT

# Start and wait for server
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))

sleep 30

Download data

This example is going to work off a subset of the CORD-19 dataset. COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, covering COVID-19 and the coronavirus family of viruses.

The following download is a SQLite database generated from a Kaggle notebook. More information on this data format, can be found in the CORD-19 Analysis notebook.

wget https://github.com/neuml/txtai/releases/download/v1.1.0/tests.gz
gunzip tests.gz
mv tests articles.sqlite

Load data into Elasticsearch

The following block copies rows from SQLite to Elasticsearch.

import sqlite3

import regex as re

from elasticsearch import Elasticsearch, helpers

# Connect to ES instance
es = Elasticsearch(hosts=["http://localhost:9200"], timeout=60, retry_on_timeout=True)

# Connection to database file
db = sqlite3.connect("articles.sqlite")
cur = db.cursor()

# Elasticsearch bulk buffer
buffer = []
rows = 0

# Select tagged sentences without a NLP label. NLP labels are set for non-informative sentences.
cur.execute("SELECT s.Id, Article, Title, Published, Reference, Name, Text FROM sections s JOIN articles a on s.article=a.id WHERE (s.labels is null or s.labels NOT IN ('FRAGMENT', 'QUESTION')) AND s.tags is not null")
for row in cur:
  # Build dict of name-value pairs for fields
  article = dict(zip(("id", "article", "title", "published", "reference", "name", "text"), row))
  name = article["name"]

  # Only process certain document sections
  if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
    # Bulk action fields
    article["_id"] = article["id"]
    article["_index"] = "articles"

    # Buffer article
    buffer.append(article)

    # Increment number of articles processed
    rows += 1

    # Bulk load every 1000 records
    if rows % 1000 == 0:
      helpers.bulk(es, buffer)
      buffer = []

      print("Inserted {} articles".format(rows), end="\r")

if buffer:
  helpers.bulk(es, buffer)

print("Total articles inserted: {}".format(rows))

Total articles inserted: 21499

Query data

The following runs a query against Elasticsearch for the terms "risk factors". It finds the top 5 matches and returns the corresponding documents associated with each match.

import pandas as pd

from IPython.display import display, HTML

pd.set_option("display.max_colwidth", None)

query = {
    "_source": ["article", "title", "published", "reference", "text"],
    "size": 5,
    "query": {
        "query_string": {"query": "risk factors"}
    }
}

results = []
for result in es.search(index="articles", body=query)["hits"]["hits"]:
  source = result["_source"]
  results.append((source["title"], source["published"], source["reference"], source["text"]))

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match"])

display(HTML(df.to_html(index=False)))

Title	Published	Reference	Match
Management of osteoarthritis during COVID‐19 pandemic	2020-05-21 00:00:00	https://doi.org/10.1002/cpt.1910	Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .
Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection	2020-04-24 00:00:00	http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1	This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.
Does apolipoprotein E genotype predict COVID-19 severity?	2020-04-27 00:00:00	https://doi.org/10.1093/qjmed/hcaa142	Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .
COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants	2020-07-23 00:00:00	https://www.ncbi.nlm.nih.gov/pubmed/32705587/	BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.
COVID-19: what has been learned and to be learned about the novel coronavirus disease	2020-03-15 00:00:00	https://doi.org/10.7150/ijbs.45134	• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.

Derive columns with Extractive QA

The next section uses Extractive QA to derive additional columns. For each article, the full text is retrieved and a series of questions are asked of the document. The answers are added as a derived column per article.

from txtai.embeddings import Embeddings
from txtai.pipeline import Extractor

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})

# Create extractor instance using qa model designed for the CORD-19 dataset
extractor = Extractor(embeddings, "NeuML/bert-small-cord19qa")

document = {
    "_source": ["id", "name", "text"],
    "size": 1000,
    "query": {
        "term": {"article": None}
    },
    "sort" : ["id"]
}

def sections(article):
  rows = []

  search = document.copy()
  search["query"]["term"]["article"] = article

  for result in es.search(index="articles", body=search)["hits"]["hits"]:
    source = result["_source"]
    name, text = source["name"], source["text"]

    if not name or not re.search(r"background|(?<!.*?results.*?)discussion|introduction|reference", name.lower()):
      rows.append(text)

  return rows

results = []
for result in es.search(index="articles", body=query)["hits"]["hits"]:
  source = result["_source"]

  # Use QA extractor to derive additional columns
  answers = extractor([("Risk factors", "risk factor", "What are names of risk factors?", False),
                       ("Locations", "city country state", "What are names of locations?", False)], sections(source["article"]))

  results.append((source["title"], source["published"], source["reference"], source["text"]) + tuple([answer[1] for answer in answers]))

df = pd.DataFrame(results, columns=["Title", "Published", "Reference", "Match", "Risk Factors", "Locations"])

display(HTML(df.to_html(index=False)))

Title	Published	Reference	Match	Risk Factors	Locations
Management of osteoarthritis during COVID‐19 pandemic	2020-05-21 00:00:00	https://doi.org/10.1002/cpt.1910	Indeed, risk factors are sex, obesity, genetic factors and mechanical factors (3) .	Comorbidities	extrapulmonary sites
Prevalence and Impact of Myocardial Injury in Patients Hospitalized with COVID-19 Infection	2020-04-24 00:00:00	http://medrxiv.org/cgi/content/short/2020.04.20.20072702v1?rss=1	This risk was consistent across patients stratified by history of CVD, risk factors but no CVD, and neither CVD nor risk factors.	CVD, risk factors but no CVD, and neither CVD	None
Does apolipoprotein E genotype predict COVID-19 severity?	2020-04-27 00:00:00	https://doi.org/10.1093/qjmed/hcaa142	Risk factors associated with subsequent death include older age, hypertension, diabetes, ischemic heart disease, obesity and chronic lung disease; however, sometimes there are no obvious risk factors .	socioeconomic inequalities and risk factors	None
COVID-19 and associations with frailty and multimorbidity: a prospective analysis of UK Biobank participants	2020-07-23 00:00:00	https://www.ncbi.nlm.nih.gov/pubmed/32705587/	BACKGROUND: Frailty and multimorbidity have been suggested as risk factors for severe COVID-19 disease.	Frailty and multimorbidity	comorbidity groupings
COVID-19: what has been learned and to be learned about the novel coronavirus disease	2020-03-15 00:00:00	https://doi.org/10.7150/ijbs.45134	• Three major risk factors for COVID-19 were sex (male), age (≥60), and severe pneumonia.	age and underlying disease are strongly correlated	cities, provinces, and countries

DEV Community

Extractive QA with Elasticsearch

Install dependencies

Download data

Load data into Elasticsearch

Query data

Derive columns with Extractive QA

Top comments (0)

Read next

Tried Phi-4, It didn't Impress

Using AI for Real-Time Customer Sentiment Tracking

How These Free Open Source Projects Can Jumpstart Your Career (No Experience? No Problem!)

Creating a full-stack AI based calorie/nutrition tracker in just 8 hrs using Supabase & Lovable