DEV Community

Rimmel Asghar
Rimmel Asghar

Posted on • Updated on

Recurrent Neural Networks (RNNs) - Easily Explained 🚀

main img

The main objective of this post is to provide an easy explanation as well to make it useful for the readers.

What is an RNN?

A recurrent neural network is a neural network that is specialized for processing a sequence of data x(t)= x(1), . . . , x(τ) with the time step index t ranging from 1 to τ. For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs. In a NLP problem, if you want to predict the next word in a sentence it is important to know the words before it. RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being depended on the previous computations. Another way to think about RNNs is that they have a “memory” which captures information about what has been calculated so far.

Architecture : Let us briefly go through a basic RNN network.


The left side of the above diagram shows a notation of an RNN and on the right side an RNN being unrolled (or unfolded) into a full network. By unrolling we mean that we write out the network for the complete sequence. For example, if the sequence we care about is a sentence of 3 words, the network would be unrolled into a 3-layer neural network, one layer for each word.

Input: x(t)​ is taken as the input to the network at time step t. For example, x1,could be a one-hot vector corresponding to a word of a sentence.

Hidden state: h(t)​ represents a hidden state at time t and acts as “memory” of the network. h(t)​ is calculated based on the current input and the previous time step’s hidden state: h(t)​ = f(U x(t)​ + W h(t−1)​). The function f is taken to be a non-linear transformation such as tanh, ReLU.

Weights: The RNN has input to hidden connections parameterized by a weight matrix U, hidden-to-hidden recurrent connections parameterized by a weight matrix W, and hidden-to-output connections parameterized by a weight matrix V and all these weights (U,V,W) are shared across time.

Output: o(t)​ illustrates the output of the network. In the figure I just put an arrow after o(t) which is also often subjected to non-linearity, especially when the network contains further layers downstream.

Forward Pass
The figure does not specify the choice of activation function for the hidden units.
Before we proceed we make few assumptions:
1) we assume the hyperbolic tangent activation function for hidden layer.
2) We assume that the output is discrete, as if the RNN is used to predict words or characters.
A natural way to represent discrete variables is to regard the output o as giving the un-normalized log probabilities of each possible value of the discrete variable. We can then apply the softmax operation as a post-processing step to obtain a vector ŷof normalized probabilities over the output.

The RNN forward pass can thus be represented by below set of equations.


This is an example of a recurrent network that maps an input sequence to an output sequence of the same length. The total loss for a given sequence of x values paired with a sequence of y values would then be just the sum of the losses over all the time steps. We assume that the outputs o(t)are used as the argument to the softmax function to obtain the vector ŷ of probabilities over the output. We also assume that the loss L is the negative log-likelihood of the true target y(t)given the input so far.

Backward Pass
The gradient computation involves performing a forward propagation pass moving left to right through the graph shown above followed by a backward propagation pass moving right to left through the graph. The runtime is O(τ) and cannot be reduced by parallelization because the forward propagation graph is inherently sequential; each time step may be computed only after the previous one. States computed in the forward pass must be stored until they are reused during the backward pass, so the memory cost is also O(τ). The back-propagation algorithm applied to the unrolled graph with O(τ) cost is called back-propagation through time (BPTT). Because the parameters are shared by all time steps in the network, the gradient at each output depends not only on the calculations of the current time step, but also the previous time steps.

Computing Gradients

Given our loss function L, we need to calculate the gradients for our three weight matrices U, V, W, and bias terms b, c and update them with a learning rate α. Similar to normal back-propagation, the gradient gives us a sense of how the loss is changing with respect to each weight parameter. We update the weights W to minimize loss with the following equation:


The same is to be done for the other weights U, V, b, c as well.

Let us now compute the gradients by BPTT for the RNN equations above. The nodes of our computational graph include the parameters U, V, W, b and c as well as the sequence of nodes indexed by t for x (t), h(t), o(t) and L(t). For each node n we need to compute the gradient ∇nL recursively, based on the gradient computed at nodes that follow it in the graph.

Gradient with respect to output o(t) is calculated assuming the o(t) are used as the argument to the softmax function to obtain the vector ŷ of probabilities over the output. We also assume that the loss is the negative log-likelihood of the true target y(t).


Let us now understand how the gradient flows through hidden state h(t). This we can clearly see from the below diagram that at time t, hidden state h(t) has gradient flowing from both current output and the next hidden state.


We work our way backward, starting from the end of the sequence. At the final time step τ, h(τ) only has o(τ) as a descendant, so its gradient is simple:


We can then iterate backward in time to back-propagate gradients through time, from t=τ −1 down to t = 1, noting that h(t) (for t < τ ) has as descendants both o(t) and h(t+1). Its gradient is thus given by:


Once the gradients on the internal nodes of the computational graph are obtained, we can obtain the gradients on the parameter nodes. The gradient calculations using the chain rule for all parameters is:


We are not interested to derive these equations here, rather implementing these. There are very good posts links below providing detailed derivation of these equations.

Hope it was useful for you.Thanks for the read.

Top comments (0)