DEV Community

gopal gupta
gopal gupta

Posted on

LSTM, GRU and a little Attention

This articles primarily focus on visualization of two most popular recurrent neural networks (RNN) i.e. long short-term memory (LSTM) and Gated recurrent units (GRU).It also talk about the intuition behind the attention mechanism.
A common LSTM unit is composed of a cell, an Input gate, an Output gate and a Forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. Forget gate takes the additional information which might have entered in the immediate last step, and maintain the long term information required. The Input gate controls the extent to which a new value flows into the cell and the Output gate controls the extent to which the value in the cell is used to compute the output activation of the LSTM unit. Here is the visual representation of LSTM.

Alt Text

The GRU is like a long short-term memory (LSTM) with a forget gate, but has fewer parameters than LSTM, as it lacks an output gate. It combines the forget and input gates into a single "update gate". It also merges the cell state and hidden state. The update gate helps the model to determine how much of the past information (from previous time steps) needs to be passed along to the future. Reset gate is used from the model to decide how much of the past information to forget. Update gates help capture long-term dependencies in sequences. Reset gates help capture short-term dependencies in sequences.
Alt Text

RNN with ATTENTION MECHANISM
In encoder-decoder architecture, the encoder processes the input sequence and encodes/compresses/summarizes the information into a context vector of a fixed length. The decoder is then initialized with this context vector, using which it starts generating the transformed output. The central idea behind Attention is not to throw away those intermediate encoder states but to utilize all the states in order to construct the context vectors required by the decoder to generate the output sequence.
Imagine you are translating “How are you” to Hindi "AAP KAISE HO". When you would be predicting AAP then you would give more weightage to you. Similarly when you predict KAISE then you will give more weightage to you and AAP. For Ho, you will give more weightage to are and "AAP KAISE". This is what is the intuition behind Attention mechanism. In Attention models the context vector has access to the entire input sequence. It includes encoder hidden states, decoder hidden state and alignment between the source and the target. Below is the neural network representation of attention :-
Alt Text

Our attention model has a single RNN encoder with 3-time steps. We denote the encoder's input vectors by x_1, x_2, x_3 and the output vectors by h_1, h_2, h_3. The attention mechanism is located between the encoder and the decoder. The decoder then generates the next word in the sequence and along with the output, the decoder will also generate an internal hidden state which will act as i/p to next decoder.

Top comments (0)