sangjun_park

Posted on Aug 30 • Updated on Sep 2

Exploring Sequence Transformation Models in NLP: An Overview of Seq2Seq, RNN, LSTM, and GRU

1. Introduction

In mathematics, a sequence transformation is an operator acting on a given space of sequences (a sequence space). Sequence transformations include linear mappings such as convolution with another sequence and resummation of a sequence, and more generally, are commonly used for series acceleration.

I’ll discuss four types of sequence transformation models: Seq2Seq, RNN, LSTM, and GRU, and how they are constructed.

2. Seq2Seq

The Seq2Seq model is a type of machine learning approach used in NLP. It can be applied to various tasks, including language translation, image captioning, conversational models, and text summarization. Seq2Seq works by transforming one sequence into another, enabling these diverse applications.

There are three kinds of architecture in the Seq2Seq model. The first one is an 'Encoder'. The encoder is responsible for processing the input sequence and capturing its essential information, which is stored as the hidden state of the network and in a model with an attention mechanism, a context vector. The context vector is the weighted sum of the input hidden states and is generated for every time instance in the output sequences.

The decoder takes the context vector and hidden states from the encoder to generate the final output sequence. It operates in an autoregressive manner, producing one element of the output sequence at a time.

At each step, it considers the previously generated elements, the context vector, and the input sequence information to predict the next element in the sequence.

In models with an attention mechanism, the context vector and hidden state are combined into an attention vector, which serves as input to the decoder.

Finally, we'll be looking for the Attention mechanism. In the basic Seq2Seq architecture, there are some limitations where a longer input sequence results in the hidden state output of the encoder becoming irrelevant to the decoder.

It enables the model to selectively focus on different parts of the input sequence during the decoding process. At each decoder step, an alignment model calculates the attention to hidden vectors as input. An alignment model is another neural network model that is trained jointly with the seq2seq model used to calculate how well an input, represented by the hidden state, matches with the previous output, represented by the attention hidden state.

3. RNN

Recurrent neural networks(RNNs) are a class of artificial neural networks for sequential data processing.

RNNs process data across multiple time steps, making them well-adapted for modeling and processing text, speech, and time series.

The fundamental building block of an RNN is the recurrent unit. This unit maintains a hidden state, essentially a form of memory, which is updated at each step based on the current input and the previous hidden state.

3-1. Configuration of RNN

An RNN-based model can be factored into two parts: Configuration and Architecture. Multiple RNNs can be combined in a data flow, and the data flow itself is the configuration. Each RNN itself may have any architecture, including LSTM, GRU, etc.

Standard

RNN comes in many variants. A definition of RNN is something like the image below.

In other words, it is a neural network that maps an input x_t into an output y_t with the hidden vector h_t playing the role of "memory", a partial record of all previous input-output pairs. At each step, it transforms input to output and modifies its "memory" to help it to better perform future processing.

Stacked RNN

A stacked RNN, or deep RNN, is composed of multiple RNNs stacked one above the other. Abstractly, it is structured as follows

and each layer operates as a stand-alone RNN, and each layer's output sequence is used as the input sequence to the layer above. There is no conceptual limit to the depth of the RNN.

Bi-directional

A bi-directional RNN is composed of two RNNs, one processing the input sequence in one direction, and another in the opposite direction. Abstractly, it is structured as follows:

Bidirectional RNN allows the model to process a token both in the context of what came before it and what came after it. By stacking multiple bidirectional RNNs together, the model can process a token increasingly contextually.

3-2. Architecture of RNN

Fully recurrent

Fully recurrent neural networks (FRNN) connect the outputs of all neurons to the inputs of all neurons. In other words, it is a fully connected network. This is the most general neural network topology because all other topologies can be represented by setting some connection weights to zero to simulate the lack of connections between those neurons.

Hopfield

The Hopfield network is an RNN in which all connections across layers are equally sized. It requires stationary inputs and is thus not a general RNN, as it does not process sequences of patterns. However, it guarantees that it will converge.

Elman networks and Jordan networks

An Elman network is a three-layer network (arranged horizontally as x, y, and z in the illustration) with the addition of a set of context units (u in the illustration). The middle (hidden) layer is connected to these context units fixed with a weight of one.

At each time step, the input is fed forward and a learning rule is applied. The fixed back-connections save a copy of the previous values of the hidden units in the context units (since they propagate over the connections before the learning rule is applied). Thus the network can maintain a sort of state, allowing it to perform tasks such as sequence prediction that are beyond the power of standard multilayer perceptron.

Jordan networks are similar to Elman networks. The context units are fed from the output layer instead of the hidden layer. The context units in a Jordan network are also called the state layer. They have a recurrent connection to themselves.

Elman and Jordan's networks are also known as "Simple recurrent networks (SRN)"

x: input vector
h: hidden layer vector
y: output vector
W, U and b: parameter matrices and vector
sigma-h and sigma-y: Activation functions

4. LSTM

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at dealing with the vanishing gradient problem present in traditional RNNS. Its relative insensitivity to gap length is its advantage over other RNNS. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". The name is made in analogy with long-term memory and short-term memory and their relationship, studied by cognitive psychologists since the early 20th century.

A common LSTM unit is composed of a cell, an input gate, an output gate, and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell.

Forget gates: Decide what information to discard from the previous state by mapping the previous state and the current input to a value between 0 and 1. A rounded value of 1 means to keep the information, and a value of 0 means to discard it.
Input gates: decide which pieces of new information to store in the current cell state, using the same system as forget gates.
Output gates: control which pieces of information in the current cell state to output by assigning a value from 0 to 1 to the information, considering the previous and current states.

Selectively outputting relevant information from the current state allows the LSTM network to maintain useful, long-term dependencies to make predictions, both in current and future time steps.

5. GRU

Gated recurrent unit (GRU) was designed as a simplification of LSTM. They are used in the full form and several further simplified variants. They have fewer parameters than LSTM, as they lack an output gate.

6. Conclusion

In this article, we explored four different types of sequence transformation models: Seq2Seq, RNN, LSTM, and GRU. Each of these models has its unique strengths and applications in the field of natural language processing (NLP) and other domains involving sequential data.

Choosing the right model depends on the specific task at hand. For tasks requiring long-term dependencies, LSTMs, and GRUs may be more appropriate, while Seq2Seq models excel in tasks that involve mapping one sequence to another, particularly when enhanced with attention mechanisms.