Named Entity Recognition(NER) Using ChatGPT

#typescript #linting #tooling #programming

This is not like every other ChatGPT blog but here we are going to try to understand how promptify is going to be used along with LLMs(Large Language Models) like ChatGPT to perform named entity recognition(NER) and how this method is much more robust than using ChatGPT directly.

Interesting takeaways

How Promptify is used for prompt engineering
Simple Named Entity Recognition(NER) using promtify+ChatGPT
Custom Labels for Named Entity Recognition(NER) using promtify+ChatGPT
One Shot Named Entity Recognition(NER) using promtify+ChatGPT
Named Entity Recognition(NER) with domain knowledge using promtify+ChatGPT

Let's begin ...

What is Prompt Engineering and Prompting

Prompt engineering is a natural language processing (NLP) concept that involves discovering/creating inputs that yield desirable or useful results. Prompting is the equivalent of telling the Genie in the magic lamp what to do. In this case, the magic lamp is Chat-GPT, ready to give answers to any of our questions, and promtify is used to build and structure our questions in such a way that LLMs like ChatGPT understand the questions better and provide desirable results.

What is Named Entity Recognition(NER)

Entity could be defined as the key information in the text. An entity could be a single word or a group of words. Named Entity Recognition(NER) could be defined as the process of identifying and classifying entities(key information) in text.

Example

Here Person, Country and Designation are the group/class to which the entities belong and the process of identifying these entities and the group to which they belong is called named entity recognition(NER).

What exactly does Promptify do?

The input and output to LLMs like ChatGPT is generally plain unstructured text, but when you pass it through promptify along with certain parameters(many of which are optional), the promtify send these LLMs a structured input which is equivalent to asking a properly structured question that would help these LLMs understand the question better. Then the output from the LLMs is returned as a python object.

OUTPUT - Plain ChatGPT vs Promtify+ChatGPT

We are going to ask ChatGPT to perform named entity recognition on plain text, we are also going to tell which domain the input sentence belongs to, then we are going to try giving the same input using promptify and let's observe the response.

Plain ChatGPT

There is a good chance that the output structure might vary upon trying again and it is comparatively hard to use this output in an application as the structure might vary upon each query.

Promtify+ChatGPT

The entity(E) and its corresponding class/type(T) are returned as python objects from promptify. And you could also observe that when passed through promptify more entities are recognized. Now comparatively this is a much more robust output and could be used in an application easily.

Now let's check out the python implementation of promptify, the code implementation is done using google colab to help explain better...

Promptify - ChatGPT

%%capture
!git clone https://github.com/promptslab/Promptify.git
!pip3 install openai

Clone the promptify repository and install openai library

# Define the API key for the OpenAI model
api_key  = ""

Paste the API key generated by following this blog How to generate API secret key

# Create an instance of the OpenAI model, Currently supporting Openai's all model, In future adding more generative models from Hugginface and other platforms
model = OpenAI(api_key)
nlp_prompter = Prompter(model)

Create an instance of the OpenAI model and pass it to the promptify's Prompter, now you have an object where you could pass your prompt with the required parameters.

# Example sentence that is sent to GPT
sent = "The patient is a 93-year-old female with a medical history of chronic right hip pain, osteoporosis, hypertension, depression, and chronic atrial fibrillation admitted for evaluation and management of severe nausea and vomiting and urinary tract infection"

This sample input is related to the medical domain, and it is about a patient's medical condition.

NAMED ENTITY RECOGNITION(NER) WITH 2 LINES OF CODE

# Named Entity Recognition with No labels, no description, no oneshot, no examples
# Simple prompt with instructions
# domain name gives more info to model for better result generation, the parameter is optional
# Output will be python object -> [ {'E' : Entity Name, 'T': Type of Entity } ]


result = nlp_prompter.fit('ner.jinja',
                          domain      = 'medical',
                          text_input  = sent, 
                          labels      = None)

# Output
pprint(eval(result['text']))

In the output
E - Entity
T - Type/Class the entity belongs to
If you observe the output, the output is a python object and is well structured compared to the raw output from ChatGPT. This functionality/feature from promptify could be extremely useful while integrating LLMs with applications. The domain parameter is optional and passing a domain to your prompt would result in a better-refined response.

That's not all about promptify....

CUSTOM LABELS FOR NAMED ENTITY RECOGNITION(NER)

You can also provide custom labels so that the custom labels and their corresponding entities would be identified from the prompt.

# If want to perform NER with custom tags only (handling out-of-bounds prediction) prompt


result = nlp_prompter.fit('ner.jinja',
                          domain      = 'medical',
                          text_input  = sent, 
                          labels      = ["SYMPTOM", "DISEASE"])

# Output
pprint(eval(result['text']))

You could observe from the output that entities that belong to the custom labels provided were identified.

ONE SHOT - NAMED ENTITY RECOGNITION(NER)

One shot learning as the name suggests is the ability of a model to understand with one training data. That's fascinating right, with a powerful LLM like GPT and with the help of promptify you can actually do it.

one_shot_training_data = "Leptomeningeal metastases (LM) occur in patients with breast cancer (BC) and lung cancer (LC). The cerebrospinal fluid (CSF) tumour microenvironment (TME) of LM patients is not well defined at a single-cell level. We did an analysis based on single-cell RNA sequencing (scRNA-seq) data and four patient-derived CSF samples of idiopathic intracranial hypertension (IIH)"

one_shot_labelled_training_data = [[one_shot, [{'E': 'DISEASE', 'W': 'Leptomeningeal metastases'}, {'E': 'DISEASE', 'W': 'breast cancer'}, {'E': 'DISEASE', 'W': 'lung cancer'}, {'E': 'BIOMARKER', 'W': 'cerebrospinal fluid'}, {'E': 'DISEASE', 'W': 'tumour microenvironment'}, {'E': 'TEST', 'W': 'single-cell RNA sequencing'}, {'E': 'DISEASE', 'W': 'idiopathic intracranial hypertension'}]]]

result = nlp_prompter.fit('ner.jinja',
                          domain      = 'medical',
                          text_input  = sent,
                          examples    = one_shot_labelled_training_data,
                          labels      = ["SYMPTOM", "DISEASE"])


pprint(eval(result['text']))

Here you have provided just 1 labelled data to the model with labels SYMPTOM and DISEASE where
E - Label/Class to which the entity belongs to
W - Entity

You could observe from the output that entities which belong to specific labels(SYMPTOM, DISEASE) that were provided in the one-shot training data were accurately identified along with their corresponding labels.

NAMED ENTITY RECOGNITION - WITH DOMAIN KNOWLEDGE

#If want to give some domain knowledge and description in prompt to enhance the output

result = nlp_prompter.fit('ner.jinja',
                          domain      = 'clinical',
                          text_input  = sent,
                          examples    = one_shot_labelled_training_data,
                          description = "Below Paragraph is from discharge summary of a patient. The Paragraph describes the condition and symptoms of patient.",
                          labels      = ["SYMPTOM", "DISEASE"])

pprint(eval(result['text']))

If you have domain knowledge, in the above case clinical domain and a small description of what the data is about, then that could be passed in the description parameter which would further improve the accuracy of the output.