DEV Community

David Mezzetti for NeuML

Posted on • Updated on • Originally published at neuml.hashnode.dev

Entity extraction workflows

This article is part of a tutorial series on txtai, an AI-powered semantic search platform.

Entity extraction is the process of identifying names, locations, organizations and other entity-like tokens in unstructured text. Entity extraction can organize data into topics and/or feed downstream machine learning pipelines.

This notebook will show how to use the entity extraction pipeline in txtai with workflows.

Install dependencies

Install txtai and all dependencies.

pip install txtai
Enter fullscreen mode Exit fullscreen mode

Extract entities

Let's get right to it! The following example creates an entity pipeline and extracts entities from text.

from txtai.pipeline import Entity

data = ["US tops 5 million confirmed virus cases",
        "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
        "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
        "The National Park Service warns against sacrificing slower friends in a bear attack",
        "Maine man wins $1M from $25 lottery ticket",
        "Make huge profits without work, earn up to $100,000 a day"]

entity = Entity()

for x, e in enumerate(entity(data)):
  print(data[x])
  print(f"  {e}", "\n")
Enter fullscreen mode Exit fullscreen mode
US tops 5 million confirmed virus cases
  [('US', 'LOC', 0.999273955821991)] 

Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
  [('Canada', 'LOC', 0.999609649181366), ('Manhattan', 'MISC', 0.651396632194519)] 

Beijing mobilises invasion craft along coast as Taiwan tensions escalate
  [('Beijing', 'LOC', 0.9996659755706787), ('Taiwan', 'LOC', 0.9996755123138428)] 

The National Park Service warns against sacrificing slower friends in a bear attack
  [('National Park Service', 'ORG', 0.9993489384651184)] 

Maine man wins $1M from $25 lottery ticket
  [('Maine', 'LOC', 0.9987521171569824)] 

Make huge profits without work, earn up to $100,000 a day
  [] 
Enter fullscreen mode Exit fullscreen mode

The section above is running an entity extraction pipeline for each row in data. The outputs are the token(s) identified as part of an entity, the type of entity and score or confidence in the prediction.

Feed entities to a workflow

The next section demonstrates how the entity extraction pipeline can be used as part of a workflow. This workflow uses the output entities and builds an embeddings index for each row. This effectively computes entity embeddings to compare the row similarity with a focus on mentioned entities.

from txtai.embeddings import Embeddings, Documents
from txtai.workflow import Workflow, Task

# Create workflow with an entity pipeline output into a documents collection
documents = Documents()
workflow = Workflow([Task(lambda x: entity(x, flatten=True, join=True)), Task(documents.add, unpack=False)])

# Run workflow
for _ in workflow([(x, row, None) for x, row in enumerate(data)]):
  pass

embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
embeddings.index(documents)

for query in ["North America", "Asia Pacific"]:
  index = embeddings.search(query, 1)[0][0]
  print(query, "\t", data[index])
Enter fullscreen mode Exit fullscreen mode
North America    Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
Asia Pacific     Beijing mobilises invasion craft along coast as Taiwan tensions escalate
Enter fullscreen mode Exit fullscreen mode

Run workflow YAML

Below is the same example using workflow YAML.

workflow = """
writable: true
embeddings:
  path: sentence-transformers/nli-mpnet-base-v2

entity:

workflow:
  index:
    tasks:
      - action: entity
        args: [null, "simple", true, true]
      - action: index
"""

from txtai.app import Application

# Create and run workflow
app = Application(workflow)
for _ in app.workflow("index", [(x, row, None) for x, row in enumerate(data)]):
  pass

# Run queries
for query in ["North America", "Asia Pacific"]:
  index = app.search(query)[0]["id"]
  print(query, "\t", data[index])
Enter fullscreen mode Exit fullscreen mode
North America    Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg
Asia Pacific     Beijing mobilises invasion craft along coast as Taiwan tensions escalate
Enter fullscreen mode Exit fullscreen mode

Wrapping up

This notebook introduced entity extraction pipelines with txtai. This pipeline supports a number of different configurations to help feed downstream systems and/or directly use the entities.

As with other pipelines, the entity extraction pipeline can be used standalone in Python, as an API service or as part of a workflow!

Discussion (0)