gopal gupta

Posted on Jan 29, 2021

Convolution Sequence to Sequence Learning

#convolution #attention #seq2seq

In this post we will be focusing on how Convolution Neural Network (CNN) is being used for Sequence to Sequence learning based on the paper Convolutional Sequence to Sequence Learning. Traditionally CNN has been primarily used for solving computer vision problem but recently it has found its usecase for problem associated with Machine translation. The major advantage of using CNN for NLP are faster training time and capturing complex relationship between the words of different lengths easily. In this blog we will be referring the Pytorch implementation of the discussed code can be find in this github link
Let's consider a traditional Seq2Seq model in below pic. We use our encoder (green) over the embedded source sequence (yellow) to create a context vector (red). We then use that context vector with the decoder (blue) and a linear layer (purple) to generate the target sentence.

The architecture described in the paper is shown below pics and it differ from the traditional Seq2Seq in various ways which we will be focusing in current article.

In this architecture we have three building block:
i) Encoder
ii) Decoder
iii)Attention

Encoder

Encoder's role is to encode the input sentence which is in the source language into a context vector. This is different from recurrent model.

Positional Embedding

There is a need of some ways in which encoder should know of where the words are actually come from and what is the location of the words when all data goes at once to it as input. Hence We need the position vector along with Word Vector. The position vector goes to the embedding layer which gives us the output which then summed element wise with word embedding. This resultant vector goes to fully connected layer. This o/p fed to convolution block after this we have again fully connected layer and finally we have a residual connection.

Why this is called residual connection?

Similar to residual path in Resnet, output from GLU activation is element wise summed with the same vector before it was passed through convolution layer.

Convolution Block

In convolution block, first step is to pad the input. This is done because convolutional layer filter is going to reduce the length of the input sentence and we want to make sure the length of the sentence coming from the convolution is equal to length of the convolution block going out. The output of convolving at each group of words at layer 1 pad is presented to GLU activation.

GLU Function

In GLU function, half the input goes through sigmoid function and half the input goes through tanh function and then an element-wise sum is performed on both tanh and sigmoid o/p. This means, let's say two inputs coming in then we will have only one output. This poses a challenge as it is going to reduce the number of channels or number of outputs which layer is receiving. Hence layer 2 convolution should the double the output so the after GLU activation layer it become equal to input Layer 1 size.
Further The model add the layer connection between layer 1 o/p to GLU activation o/p. This will allow our neural network to handle the gradients issue in case it rises in convolution neural network. In effect, It adds our positional and word embedding to it and that becomes our residual connection result connection. In the next layer what we do exactly same thing we're going to add padding to it we're going to send to a convolution block and the output goes through GLU block which reduces the dimension so we need to make sure that convolution doubles the dimension and whatever input was going to the convolution is going to be sent here so we have the residual connection again.

*Now residual connection gives this input to next layer. we are saying nx convolution blocks as it can repeat itself n times internally. We have to make sure is the input dimension and the output dimension is same because this layer is hidden dimension.

Below is explanation for encoder code:

class Encoder(nn.Module):
    def __init__(self, 
                 input_dim, 
                 emb_dim, 
                 hid_dim, 
                 n_layers, 
                 kernel_size, 
                 dropout, 
                 device,
                 max_length = 100):
        super().__init__()

        assert kernel_size % 2 == 1, "Kernel size must be odd!"

        self.device = device

        self.scale = torch.sqrt(torch.FloatTensor([0.5])).to(device)

        self.tok_embedding = nn.Embedding(input_dim, emb_dim)
        self.pos_embedding = nn.Embedding(max_length, emb_dim)

        self.emb2hid = nn.Linear(emb_dim, hid_dim)
        self.hid2emb = nn.Linear(hid_dim, emb_dim)

        self.convs = nn.ModuleList([nn.Conv1d(in_channels = hid_dim, 
                                              out_channels = 2 * hid_dim, 
                                              kernel_size = kernel_size, 
                                              padding = (kernel_size - 1) // 2)
                                    for _ in range(n_layers)])

        self.dropout = nn.Dropout(dropout)

tok_embedding is word embedding which requires input dimension and embedded dimension.
Position embedding requires max lengthe of I/p and embedding dimension. We need to put max length because we can't handle all the length of the sequence we need to have a max_legth which is the capacity for network. So let's say we are going to be looking at only 100 word sentences and the max lens is going to be 100 here and then we need to match to embedding dimension also because we're going to be soon merging them.
We also have to feedforward embedding linear layer namely emb2hid and hid2emb.
In between of these feedforward layers, we add the convolution block names convs. In code, to store convolution block we use nn.module which holds conv1D in a list which is index able. If we assuume to have a list of nn.linear 10 cross 10 for i in range 10. so basically it just made my 10 layers. We just need to run a loop for the layers which we have and it inside it's going to handle all the one convolution block connected to the second convolution block connect to third conversion block so that is what we're going to be using so we're going to be calling this self.convs equals to n.module list.
We give appropriate padding and kernel size and o/p size as input. We uses 1d convolution. Padding should be done in such a way that the first word must lies in center of convolving word. If we had 3 cross 3 kernels then padding of one. Similarly if the kernel size is five then we need a padding of two. If the kernel size is seven we need a padding of three on both side. This will make sure that center the center pixel of the kernel must be at the first word which can be easily represented by kernel size minus one and integer division by two. We will further use drop out value which is represented by nn.dropout.

In forward Function,

    def forward(self, src):

        #src = [batch size, src len]

        batch_size = src.shape[0]
        src_len = src.shape[1]

        #create position tensor
        pos = torch.arange(0, src_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)

        #pos = [0, 1, 2, 3, ..., src len - 1]

        #pos = [batch size, src len]

        #embed tokens and positions
        tok_embedded = self.tok_embedding(src)
        pos_embedded = self.pos_embedding(pos)

        #tok_embedded = pos_embedded = [batch size, src len, emb dim]

        #combine embeddings by elementwise summing
        embedded = self.dropout(tok_embedded + pos_embedded)

        #embedded = [batch size, src len, emb dim]

        #pass embedded through linear layer to convert from emb dim to hid dim
        conv_input = self.emb2hid(embedded)

        #conv_input = [batch size, src len, hid dim]

        #permute for convolutional layer
        conv_input = conv_input.permute(0, 2, 1) 

        #conv_input = [batch size, hid dim, src len]

        #begin convolutional blocks...

        for i, conv in enumerate(self.convs):

            #pass through convolutional layer
            conved = conv(self.dropout(conv_input))

            #conved = [batch size, 2 * hid dim, src len]

            #pass through GLU activation function
            conved = F.glu(conved, dim = 1)

            #conved = [batch size, hid dim, src len]

            #apply residual connection
            conved = (conved + conv_input) * self.scale

            #conved = [batch size, hid dim, src len]

            #set conv_input to conved for next loop iteration
            conv_input = conved

        #...end convolutional blocks

        #permute and convert back to emb dim
        conved = self.hid2emb(conved.permute(0, 2, 1))

        #conved = [batch size, src len, emb dim]

        #elementwise sum output (conved) and input (embedded) to be used for attention
        combined = (conved + embedded) * self.scale

        #combined = [batch size, src len, emb dim]

        return conved, combined

We have self.token embedding's and token embedding are required to sent the source information. *self. pos_embedded contains the positional information. Here is how we calculate pos vector which is input to self.pos_embedded. a) batch size can be calculated from the zero dimension of src. b) src_length can be computed from the first dimension of src. c) Assume the src_length of 5 and batch_size of 3. For pos vector we would be needed an array of src_length repeated by batch_size so the positional i/p can be provided per batch.This way in case of multiple batches coming in , we can provide the position for all the sentence in the batch. Ex.

>>>torch.arange(0, 5) tensor([0, 1, 2, 3, 4, 5])'''

torch.arange(0, 5).unsqueeze(0).repeat(3, 1) tensor( [[0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5]])

The position vector needs to be sent at the gpu that is why we need the device information too. In next step we combine the embedding's by element wise summing and put it inside dropout function which help in regularization during training.

In next step we pass the embedded layer through linear layer to convert from emb dim to hid dimension. Here we need to notice the o/p of conv_input. #conv_input = [batch size, src len, hid dim]

As discussed earlier, For convolution network we need the o/p in below format
conv_input = [batch size, hid dim, src len]. To achieve this, we use permute (0,2, 1).

We enumerate the self.convs module list for traversing each convolution layer and pass through each convolution layer in list. We added dropout as regularization technique on each layer.
#GLU OPTIMIZATION: After the dropout, we pass the output to glu optimization function which reduces the output dimension to half. This is the reason we use out_channels dimension to 2* hid_dim. For residual connection, we add the convolution input and output. However due to addition we have introduced the sudden increase in the overall gradients which may explode the network during back propogation. To normalize it we multiply by scaling factor (for ex:- 0.5).
This convolution output become the I/p for the next layer and loop continues till layers traversal complete. The conv input was permuted so that means the output is also permuted so we permute it for make it ready for the shape we require for handling the fully connected layer. *After applying the fully connected layer we get the embedding which need to be return as one vector from encoder's forward module. For our combined vector we use conv plus embedded and this need to be transformed with the scaling factor for the gradient explode reasons as explained earlier. This is our second return vector from encoder forward module.

Decoder

Decoder convolutional blocks is similar to the one within encoder. However there are few changes. As shown in the pic, we had six tokens here and two context vector per token are conned and combined and then passed to decoder from encoder, so we'll have 12 contacts vector. Decoder will have two inputs conv and the combined input. The decoder is going to take the actual target sentence and it's trying to predict a whole sentence itself. This model is different and differs from the recurrent model because it's going to predict all the tokens in the sentence parallel there's no sequence processing here there's no decoupling loop in the decoder. If the whole thing gets printed out at one go ,then we'll have this confusion of how do we do inference part. for decoder we only padded from (kernel_size - 1)the beginning of the sentence, the padding makes sure only one word goes in. Target of the decoder is shifted by one word from its input. Since we are processing all the target sequence simultaneously, so we need a method of not only allowing the filter to translate the token that we have to the next stage, but we also need to make sure that the model will not learn to output the next word in the sequence by directly copying the next word, without actually learning how to translate.

Attention

The model uses both encoder conved and encoder combined, to figure out where exactly the encoder want the model to focus on while making the prediction. Firstly, we take the conved output of a word from the decoder, do a element wise sum with the decoder input embedding to generate combined embedding. Next, we calculate the attention between the above generated combined embedding and the encoder conved, to find how much it matches with the encoded conved. Then, this is used to calculate the weighted sum over the encoded combined to apply the attention. This is then projected back up to the hidden dimension size and a residual connection to the initial input is applied to the attention layer.
Below is the attention o/p matrix of this model on an example and It shows it has performs very well.

DEV Community

Convolution Sequence to Sequence Learning

Encoder

Positional Embedding

Why this is called residual connection?

Convolution Block

GLU Function

Decoder

Attention

Top comments (0)

Read next

Cloudflare Pages and Next.js: I'm Not Recommending It

Simple Route53 :)

my favorite 3 Chrome extensions

Lag Is Never Where You Want It... Or Don't Want It