Ankita Sahoo

Posted on Sep 30, 2023

NLP Application (Real-world implementation of Transformer model)

Natural language processing (NLP) is a field of artificial intelligence that deals with the interaction between computers and humans using natural language. The importance of NLP lies in its ability to transform how humans and computers interact, enabling more intuitive and human-like communication between them.
This has numerous practical applications in information retrieval, sentiment analysis, machine translation, and question-answering, among others. NLP has the potential
to revolutionize many industries, such as healthcare, education, and customer service, by enabling more effective and efficient communication and information management.
As such, NLP has become an important area of research and development, with significant investment being made in its advancement.

The Transformer:

Attention is all you need is the first state of art paper in the field of NLP. The fundamental unit of said architecture, the transformer block, consists of two main components: a multi-head self-attention mechanism and a fully connected feedforward network. The multi-head self-attention mechanism allows the model to focus on different parts of the input sequence at each layer and weigh the importance of each part in making a prediction. This is accomplished by computing attention scores between each element in the input sequence and all other elements, which are then used to weigh the contribution of each token to the final representation. The multi-head attention mechanism allows the model to learn different attention patterns for different tasks and input sequences, making it a more versatile and effective algorithm.

The feedforward network is essentially a multi-layer perceptron (MLP) that takes in the self-attention-generated representation as input, applies linear transformations with activation functions, and outputs the final representation. This final representation is then
passed to the next Transformer block or used for making predictions

BERT:

Later on, based on the Transformer model, One of the most significant developments in this direction was the introduction of BERT(Bidirectional Encoder Representations from Transformers). BERT is a pre-trained transformer model that can be fine-tuned for a wide range of NLP tasks, such as sentiment analysis, named entity recognition, and question answering. BERT was trained using a contrastive task, where it was asked to predict missing tokens in a sentence given the context of the surrounding tokens. This approach allowed BERT to learn rich contextual representations of words, making it highly effective for a wide range of NLP tasks.

GPT:

Another development in the field of transformers was the introduction of GPT (Generative Pretrained Transformer). GPT is a generative model trained on a large corpus of text with the goal of predicting the next token in a sequence given the context of the surrounding tokens. GPT has been shown to be highly effective for tasks such as text generation, language modeling, and question-answering. Unlike BERT, which was trained using a contrastive task, GPT was trained using a generative task, allowing it to learn a more diverse and complete representation of language.

Transformers have not only revolutionized the field of NLP; they are growing beyond it and finding applications in other areas. For example, transformers have been used in computer vision tasks such as image captioning, where they have been used to generate
captions for images based on their content, and have been used in speech recognition, where they have been used to transcribe speech into text.

Another trend in the use of transformers is the development of multimodal models, which allow for the unified modeling and use of text along with other modalities, such as images and audio. These models can help to understand the relationships between different modalities and can use this understanding to perform a wide range of tasks, such as image-to-text generation, text-to-image generation, and audio-to-text generation. Indeed, transformers are growing beyond the field of NLP and are being used in a wide range of
tasks and applications.

Categorization:

The kind of transformer architecture used in NLP applications plays a crucial role in determining the overall performance of the system.

Encoder - decoder based:

Encoder-only transformers are used for discriminative tasks such as sentiment analysis and named entity recognition.
Decoder-only transformers are used for tasks such as text generation and summarization.
Encoder–decoder transformers are used for tasks such as machine translation and image captioning.

Modality based:

Modality refers to the different modes or types of data and information that can be processed and generated beyond text.

Unimodal NLP applications deal with a single modality, such as text or speech
Multimodal NLP applications deal with multiple modalities, such as text, speech, and images. Text often serves as the primary interface in multimodal applications.

Real-world Applications:

Unimodal Applications:

Unimodal applications refer to AI-based systems that primarily focus on processing and analyzing text as their main modality.

i: Language Modeling:

Language modeling is a fundamental task in NLP that involves predicting the next word in a sequence of text based on the preceding words.
Language modeling typically follows the decoder-only architecture.
The goal of language modeling is to estimate the probability distribution of sequences of words in a given language and is used as a building block for many NLP tasks such as machine translation, speech recognition, and text generation. Language modeling can be easily extended to more complex NLP tasks such as sentence-pair modeling, cross-document language modeling, and definition modeling.

ii: Question Answering:

Question Answering is an NLP application that involves automatically answering questions posed in natural language. The goal of question answering is to extract the relevant information from a given text corpus and present it as an answer to a user’s question.
Question-answering systems can operate over a wide range of text types, including news articles, Wikipedia pages, and others, and can be designed to answer a wide range of questions, including fact-based questions, opinion questions, and others.

There are several subtasks within QA, as follows:

Open-Domain Question Answering (ODQA): This task involves finding an answer to a question from an open domain, such as the entire internet or a large corpus of text. The goal is to find the most relevant information to answer the question, even if it requires synthesizing information from multiple sources. Reformer is a deep learning model for ODQA.
Conversational Question Answering (CQA): This task involves answering questions in a conversational setting, where the model must understand the context of the conversation and generate an answer that is relevant and appropriate for the current conversational context. SDNet is a deep-learning model for conversational question answering (CQA)
Answer Selection: This task involves ranking a set of candidate answers for a given question, where the goal is to select the most accurate answer from the candidate set.
Machine Reading Comprehension (MRC): This task involves understanding and answering questions about a given passage of text. The model must be able to comprehend the text, extract relevant information, and generate an answer that is accurate and relevant to the question. XLNet is a deep-learning model MRC.

iii: Machine Translation:

It is the task of automatically converting a source text in
one language to a target text in another language. The goal of machine translation is to produce a fluent and accurate translation that conveys the meaning of the source text in the target language. MT models often follow an encoder–decoder architecture to capture the context effectively using a bidirectional encoder and be able to generate text of arbitrary
length, following the original formulation of transformer architecture. There are several subtasks within MT,

Transliteration: It involves translating text from one script to another, such as translating between the Latin and Cyrillic scripts. It involves preserving the meaning of words, rather than translating the meaning of words to another language.
Unsupervised Machine Translation (UMT): It involves translating between two languages without any parallel training data, meaning that there is no corresponding text in the target language for the source language text. UMT models are typically trained on monolingual data in each language.
Bilingual Lexicon Induction (BLI): It involves inducing a bilingual lexicon, automatically discovering word translation pairs or mappings between two languages without the need for explicit bilingual dictionaries or parallel corpora.

iv: Text classification:

Text classification is the task of categorizing a text into one or more predefined categories based on its content. The goal of text classification is to automatically assign a label to a given text based on its content, allowing it to be organized and categorized for easier analysis and management.
These models are trained on annotated text data in order to learn
the relationship between the text content and its label, and can then be used to classify new unseen text data.
Text classification models typically follow a decoder-only architecture.
its subcategories are

Document Classification: This task involves assigning a label or category to a full document, such as a news article, blog post, or scientific paper. Document classification is typically accomplished by first representing the document as a numerical vector and then using a machine-learning model to make a prediction based on the document’s representation.
LinkBERT extends the pre-training objective of BERT
to incorporate links between documents.
Cause and Effect Classification: This task involves identifying the cause and effect relationship between two events described in a sentence or paragraph.

v: Text Generation:

Text Generation is a task in NLP in which the objective is to produce new text automatically, typically starting from a given prompt or input. The output can be a single word, phrase, sentence, or full-length piece of text, and is used for chatbots, content creation, and more. The generated text should reflect an understanding of the input and the language being generated, and the quality and coherence of the generated text can vary depending on the approach used.
Text generation typically follows a decoder-only architecture, however, recent issues with prompt-injection attacks have
migrated part of the focus towards encoder-decoder models that have been instruction-tuned, such as T5.
Text generation subtasks include:

Dialogue Generation: It focuses on generating text in the form of a conversation between two or more agents. Dialogue generation systems are used in various applications, such as chatbots, virtual assistants, and conversational AI systems. These systems use dialogue history, user input, and context to generate appropriate and coherent responses. P2-BOT is a transmitter–receiver-based framework that aims to explicitly model understanding in chat dialogue systems through mutual persona perception.
Code Generation: It focuses on generating code based on a given input, such as a natural language description of a software problem. Code generation systems are used in software development to automate repetitive tasks, improve productivity, and reduce errors.
Data-to-Text Generation: It focuses on generating natural language text from structured data such as tables, databases, or graphs. Data-to-text generation systems are used in various applications, such as news reporting, data visualization, and technical writing.

vi: Text Summarization:

Text Summarization is a task in NLP where the goal is to condense a given text into a shorter and more concise version while preserving its essential information. This is typically
accomplished by identifying and extracting the most important information, sentences, or phrases from the original text.
Text summarization is used in a variety of applications, such
as news aggregation, document summarization, and more.
Text summarization typically requires an encoder–decoder architecture to completely capture the source information.

its sub-categories are,

Extractive Summarization: It extracts the most important sentences or phrases from a document and presents them as a summary. Extractive summarization methods typically use a combination of information retrieval and natural language processing techniques to identify the most informative sentences or phrases in a document.
Abstractive Summarization: It generates a summary by synthesizing new information based on the input document.
These models are trained on large amounts of data and can generate summaries that are more concise and coherent than extractive summaries. mBart is a sequence-to-sequence transformer trained on multiple large-scale monolingual corpora with the objective of denoising.
Multi-Document Summarization: It summarizes multiple related documents into a single summary. Multi-document summarization methods typically use information retrieval techniques to identify the most important documents and natural language processing techniques to generate a summary from the selected documents.
Query-Focused Summarization: It summarizing a document
based on a specific query or topic. Query-focused summarization methods use information retrieval techniques to identify the most relevant sentences or phrases in a document and present them as a summary.
Sentence Compression: It focuses on reducing the length of a sentence while preserving its meaning. Sentence compression methods typically use natural language processing techniques to identify redundant or unnecessary words or phrases in a sentence and remove them to create a more concise sentence. DistilRoBERTa is a reinforcement learning algorithm to predict a binary classifier that keeps or discards words to reduce sentence length.

vii: Sentiment Analysis:

Sentiment Analysis is a task in NLP with the goal of determining the sentiment expressed in a given text. This is typically accomplished by assigning a sentiment label such as positive, negative, or neutral to the text based on its contents.
The sentiment can be expressed in different forms, such as opinions, emotions, or evaluations, and can be expressed at various levels of granularity, such as at the document, sentence, or aspect level. Sentiment Analysis is used in a variety of applications, such as customer service, marketing, and opinion mining.
The quality of the sentiment analysis results can be influenced by factors such as the subjectivity of the text, the tone, and the context in which the sentiment is expressed.

viii: Named Entity Recognition:

Named Entity Recognition (NER) is a task in NLP with the goal of identifying and categorizing named entities present in a given text into predefined categories such as person names, organizations, locations, dates, and more. NER is used as an intermediate step in various applications such as question-answering, event extraction, and information retrieval.
It typically utilizes an encoder-only architecture. While
the approach of fine-tuning a pre-trained model with a classification head added on top for NER work well in practice, Automated Concatenation of Embeddings (ACE) has
shown improved results using an ensemble of several pre-trained models while training only a simple classifier on top using reinforcement learning.

ix: Information Retrieval:

Information Retrieval (IR) is a task in NLP with the goal of retrieving relevant information from a large collection of documents in response to a user query. This is typically
accomplished by matching the query terms against the document content and ranking the documents based on their relevance to the query.
IR systems can be used for various applications, such as web search, document search, and question answering. The quality
of the retrieval, results can be influenced by factors such as the relevance of the documents, the effectiveness of the ranking algorithm, and the representation of the documents and queries.
IR systems are typically classified further based on the level of granularity, such as document, paragraph, sentence, etc. The typical methods for retrieval include the use of a pre-trained model such as RoBERTa in a Siamese fashion to find the similarity between two embeddings.

Multimodal Applications: Multimodal applications are software or systems that process and integrate information from multiple modalities or types of data, such as text, images, speech, and more, to provide a richer and more comprehensive user experience. These applications leverage multiple sources of data to enhance understanding, enable interaction, and solve complex problems.

i: Generative Control:

Generative Control is a task in multimodal NLP in which text is used as an interface to generate another modality, such as images or speech. The goal of Generative Control is to generate a target modality that corresponds to a given text description or instruction.
For example, based on a textual description of an object, such as "a red sports car," the task of Generative Control would be to generate an image of a red sports car.
Generative Control combines the strengths of NLP and computer graphics or speech synthesis to produce high-quality and semantically meaningful outputs in the target modality. It has
applications in areas such as computer vision, robotics, and human–computer interaction.

ii: Description Generation:

Description generation, a subset of natural language processing (NLP), involves automatically creating human-like text descriptions based on structured data, prompts, or other information. Description Generation would be to generate a textual description of the objects, actions,
and attributes present in the scene.
It aims to generate coherent and contextually relevant text for various applications, such as product descriptions, data visualization, virtual assistants, and content generation. This process can be rule-based, template-based, or driven by machine learning models that capture complex linguistic patterns and context.

iii: Multimodal Question Answering:

Multimodal Question Answering (QA) is a task with the goal of answering questions about a given multimodal input, such as an image or a video, using information from multiple modalities. The task involves combining information from text, images, audio,
and other modalities to accurately answer questions about the content of the input.
For example, given an image of a scene and a question about the scene, such as “What is the color of the car?”, the task of Multimodal QA would be to identify the car in the image
and answer the question with the correct color. Multimodal QA requires the integration of NLP, computer vision, and other relevant modalities
BEiT performs masked language modeling on images, texts, and image-text pairs

Resource: Transformers in the Real World: A Survey on NLP Applications

DEV Community