Attention is all you need!

#nlp #transformer #attention #mutiheadattention

In this post we will be focusing on how Transformer is being used for Sequence to Sequence learning based on the Attention Is All You Need. In this blog we will be referring the Pytorch implementation of the discussed code can be find in this github link.
One of the biggest issues with Recurrent Neural Networks (RNN), is based on sequential computation. For sentence "I was at the bank" the code to process the word "bank", it has to first go through "I" and then "was" and then "at" and then "the" and then "bank". This poses two challenges:-

Loss of Context:

For example, it is harder to keep track of whether the subject is singular or plural as it move further away from the subject or maintain if a word mean different based on the context. For example bank means different in the both the direction.
I was at the river bank 0.9 0.1 0.6
I was at the bank 0.9 0.1 0.6
In this example, embedding for both 'bank' in these words is going to be exactly the same though we know the words here mean different thing. The river bank is literally the bank of the river the corner part of the river where we see the sand and interact but the the bank where we go and take money and do our transactions. To to get a better value of these embedding's by

Adding some time attention sometimes
Working with the convolutional networks
Adding positional embedding's etc.

Transformer takes this concept to next label and uses Multi-head attention, where each head acts like a channel in convolution network and uses individually linear transformations to represent words. When you use a multi-head attention, a head can learn different relationships between words from another head.

Vanishing Gradient:

Second challenge poses by RNN like model, During back-propagation, the gradients can become really small and as a result, model will not be learning much.

The architecture described in the paper is shown below pics which contains following Flow:-

src -> Encoder -> Encoder Layers -> Self_attention(Multi-head attention) -> FeedforwardNetwork -> Decoder -> Decoder Layers -> Self_attention(Multi-head attention) -> MaskedMulti-head_attention(MaskedMulti-head attention) -> Encoder_decoder_attention(Multi-head attention) -> feed-forward->Linear Layes-> Output Probabilities

Let's Look at the individual component in details:-

Encoder

Encoder contains the sequence of context vectors which has seen all tokens at all positions within the input sequence. This is different from traditional RNN which has only seen tokens before it. Tokens are passed through a standard embedding layer
token position in sequence are passed through a positional embedding layer. Token embedding's are multiplied by a scaling factor sqrt(d_model), d_model is the hidden dim size (reduces variance).Token and positional embedding's are element wise summed together to get a vector and then we apply dropout to the combined embedding's. Later, combined embedding's are passed through encoder layer.

Encoder Layer:
Pass the source sentence and its mask into the multi-head attention layer and apply dropout. Then we apply a residual connection and pass it through a Layer Normalization layer
pass it through a feedforward and again apply dropout. After this
we apply a residual connection and then layer normalization to get the output.

Multi Head Attention Layer

One of the key component is the multi-head attention layer.
Attention can be though of as queries, keys and values - where the query is used with the key to get an attention vector which is then used to get a value. In multi head attention ,Instead of doing a single attention application the queries, keys and values have their hid_dim split into n heads and the scaled dot-product attention is calculated over all heads in parallel. This means instead of paying attention to one concept per attention application, it pay attention to n heads. Each head uses different linear transformation to represent words. A head can learn different relationships between words from another head.
Let's have an embedding for a word. We first derive Query (Q), Key(K), Value(V) with the linear layers and then split the hid_dim of the query, key and Value into n_heads*head_dim. Then we multiply them together and then divide a scale to get the energy. But we still need to calculate the attention for how much of each word we are going to use. To calculate this attention, the softmax has been applied to these vector. Then, To get a probability ,softmax o/p has been multiply by the Value(V). This gives the new representation of the word. Incase of multi head many heads, we concatenate them and then multiply again by a matrix that is of dimension (dim of each head by num heads - dim of each head) to get one final vector corresponding to each word.

Decoder

The objective of the decoder is to take the encoded representation of the source sentence, and convert it into predicted tokens in the target sentence. Then compare predicted tokens with the actual tokens in the target sentence to calculate loss which will be used to calculate the gradients of parameters and then use optimizer to update weights in order to improve predictions.
The decoder has two multi-head attention layers. A masked multi-head attention layer over the target sequence, and a multi-head attention layer which uses the decoder representation as the query and the encoder representation as the Key and Value.

The decoder uses positional embedding's and combines - via an elementwise sum - with the scaled embedded target tokens, followed by dropout. The combined embedding's are then passed through the N decoder layers, along with the encoded source, enc_src, and the source and target masks. The decoder representation after the Nth layer is then passed through a linear layer. Source mask has been used to prevent model attending to tokens. Target mask is used for processing all of the target tokens at once in parallel, as a method of stopping the decoder from “cheating” by simply “looking” at what the next token in the target sequence is and outputting it.
Sample o/p of the model are as follows:-

This model has specialty of training very fast due to less number of training parameter and has application in many areas of NLP viz text summarization, Auto Completion, Chatbot, Translation etc.GPT-2, BERT, and T5 are some of the latest transformer models.

DEV Community

Attention is all you need!

Loss of Context:

Vanishing Gradient:

Encoder

Multi Head Attention Layer

Decoder

Top comments (0)

Read next

Unveiling the Power of WebSockets in Node.js

GitHub Sponsor Rust developer Andrew Gallant (BurntSushi)

PWA: Build Installable Next.js App that Works Offline

Using Laravel CRM To Streamline Customer Relationship Management