James Casia

Posted on Nov 30, 2022

Creating Custom Models with Pytorch

#emptystring

Pytorch was built with custom models on mind. With just a few lines of code, one can spin up and train a deep learning model in a couple minutes. This is a quick guide to creating typical deep learning models in Pytorch.

Simple Custom Models

To build a custom model, just inherit nn.Module and define the forward function.

class FCN(nn.Module):
    def __init__(self, input_dims, output_dims):
        super(FCN, self).__init__()

        self.model = nn.Sequential(
            nn.Linear(input_dims, 5), 
            nn.LeakyReLU(), 
            nn.Linear(5, output_dims), 
                        nn.Sigmoid()
        )

    def forward(self, X):
        return self.model(X)

And when you instantiate it,

fcn = FCN(10, 1)
fcn

FCN(
(model): Sequential(
(0): Linear(in_features=10, out_features=5, bias=True)
(1): LeakyReLU(negative_slope=0.01)
(2): Linear(in_features=5, out_features=1, bias=True)
(3): Sigmoid()
)
)

Simple as that, you now have a 2-layer neural network! Take note of this pattern as this repeats to more complex models.

A Slightly More Complicated Model

Sometimes, models have repetitive layers, these are called blocks. These are usually found in Convolutional Neural Networks, wherein a block usually contains Convolutional layers, followed by Maxpooling layers. We create these blocks through writing our own custom functions! This time, we'll create a Linear block with BatchNormalization and ReLU. (Don’t worry, we’ll get to CNNs later in this guide)

Simply create a function and return a Sequential model that wraps the layers.

def net_block(input_dim, output_dim):
    return nn.Sequential(
        nn.Linear(input_dim, output_dim),
        nn.BatchNorm1d(output_dim),
        nn.ReLU()
    )

Let’s add a few of these blocks to our model!

class Network(nn.Module):
    def __init__(self, input_dim, hidden_dim_1, hidden_dim_2, output_dim):
        super(Network, self).__init__()

        self.model = nn.Sequential(
            net_block(input_dim, hidden_dim_1),
            net_block(hidden_dim_1, hidden_dim_2),
            net_block(hidden_dim_2, output_dim)
        )

    def forward(self, X):
        return self.model(X)

net = Network(10, 4, 5, 1)
net

Network(
(model): Sequential(
(0): Sequential(
(0): Linear(in_features=10, out_features=4, bias=True)
(1): BatchNorm1d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
)
(1): Sequential(
(0): Linear(in_features=4, out_features=5, bias=True)
(1): BatchNorm1d(5, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
)
(2): Sequential(
(0): Linear(in_features=5, out_features=1, bias=True)
(1): BatchNorm1d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(2): ReLU()
)
)
)

We can actually see that we have these repeating blocks. Phew! This saved us from defining these 9 individual layers one by one.

VGG Network

VGG Networks were one of the first deep CNNs ever built! Let’s try to recreate it. First, we define the blocks. VGG has two types of blocks, defined in this paper.

Figure 1. The two types of VGG Blocks: the two-layer(blue, orange) and the three-layer ones(purple, green, red). Image taken from this datasciencecentral article.

Let’s the define the two blocks.

def vgg_block(in_channels, out_channels):  
    return  nn.Sequential(  
        nn.Conv2d(in_channels, out_channels, kernel_size=(3,3), stride=1, padding = (1,1)),
        nn.ReLU(inplace =True), 
        nn.MaxPool2d(kernel_size=(2,2),stride=2)
    )

def vgg_block2(in_channels, out_channels):
    return nn.Sequential(
        nn.Conv2d(in_channels, out_channels, kernel_size=(3,3), stride=1, padding = (1,1)),
        nn.ReLU(inplace =True), 
        nn.Conv2d(out_channels, out_channels, kernel_size=(3,3), stride=1, padding = (1,1)),
        nn.ReLU(inplace =True),  
        nn.MaxPool2d(kernel_size=(2,2),stride=2)
    )

Let’s also define a flatten layer. A flatten layer basically reshapes a tensor into an n-dimensional vector.

class Flatten(nn.Module):
    def forward(self, input):
        return input.view(input.size(0), -1)

Let’s define the classification block.

def classifier_block(input_dim, hidden_dim, num_classes):
    return nn.Sequential( 
        Flatten(),
        nn.Linear(input_dim, hidden_dim),
        nn.ReLU(),
        nn.Dropout(0.5),
        nn.Linear(hidden_dim, hidden_dim),
        nn.ReLU(),
        nn.Dropout(0.5),
        nn.Linear(hidden_dim , num_classes),
        nn.Softmax()
    )

Now let’s define the model.

class VGG(nn.Module):
    def __init__(self):
        super(VGG, self).__init__()
        self.model = nn.Sequential( 
            vgg_block(3, 64 ),
            vgg_block(64, 128),
            vgg_block2(128, 256),
            vgg_block2(256, 512), 
            vgg_block2(512, 512),   
            nn.AdaptiveAvgPool2d(output_size=(7, 7)),
            classifier_block(512*7*7, 4096, 1000)
        )

    def forward(self, X):
        return self.model(X)

vgg = VGG()
vgg

VGG(
(model): Sequential(
(0): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): MaxPool2d(kernel_size=(2, 2), stride=2, padding=0, dilation=1, ceil_mode=False)
)
(1): Sequential(
(0): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): MaxPool2d(kernel_size=(2, 2), stride=2, padding=0, dilation=1, ceil_mode=False)
)
(2): Sequential(
(0): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace=True)
(4): MaxPool2d(kernel_size=(2, 2), stride=2, padding=0, dilation=1, ceil_mode=False)
)
(3): Sequential(
(0): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace=True)
(4): MaxPool2d(kernel_size=(2, 2), stride=2, padding=0, dilation=1, ceil_mode=False)
)
(4): Sequential(
(0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU(inplace=True)
(2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU(inplace=True)
(4): MaxPool2d(kernel_size=(2, 2), stride=2, padding=0, dilation=1, ceil_mode=False)
)
(5): AdaptiveAvgPool2d(output_size=(7, 7))
(6): Sequential(
(0): Flatten()
(1): Linear(in_features=25088, out_features=4096, bias=True)
(2): ReLU()
(3): Dropout(p=0.5, inplace=False)
(4): Linear(in_features=4096, out_features=4096, bias=True)
(5): ReLU()
(6): Dropout(p=0.5, inplace=False)
(7): Linear(in_features=4096, out_features=1000, bias=True)
(8): Softmax(dim=None)
)
)
)

We now have recreated the VGG model! How cool is that.

A Complex Model

What we’ve encountered so far are relatively simple networks, they take in the output of the previous layer and transform it. An improvement from this are residual networks, these networks pass the output from an earlier layer onto a layer a few steps ahead forming a “skip connection”. This has been shown to help deep neural networks learn better as gradients can more readily flow through these connections. A few examples of residual networks are ResNets, and UNet.

Residual Networks(ResNets)

Figure 2. ResNet. Image taken from deeplearning.ai Convolutional neural networks course

Residual networks are designed a bit differently to cater to this feature. In these types of networks, we need to have a reference to these layers that have skip connections(which you will see later on).

This time using functions for creating blocks won't cut it anymore. This is because our forward function needs to be customized to accommodate skip connections!

The Residual Block

class ResidualBlock(nn.Module):
    def __init__(self, input_dims, output_dims):
        super(ResidualBlock, self).__init__()
                # Defining the blocks
        self.conv1 = nn.Sequential( 
            nn.Conv2d(input_dims, output_dims, kernel_size = (3,3), padding = (1,1)),
            nn.ReLU()
        )
        self.conv2 = nn.Conv2d(output_dims, output_dims, kernel_size =(1,1))
        self.act = nn.ReLU()

    def forward(self, X): 
                # Clone X as this will be passed in a later layer
        X2 = X.clone()
        X = self.conv1(X)
        X = self.conv2(X)
                # Add the output of the previous layer to X2
        X = self.act(X + X2)
        return X

In the above code, we can see that we have separate attributes for the different layers and we have customized our forward function a bit. Since we needed to pass the initial input X onto a future layer, we stored its value in X2. We then transform X in place by passing it to the layers and then finally adding X2 before the final activation.

Question: How come this is possible? Don’t convolutions downsample the image thus making X + X2 not possible due to their difference in shape.

Great question! I actually asked myself this while reviewing ResNets. It turns out that the reason why this works is as simple as that these tensors have the same exact shape. The first convolution in the block particularly has a 1x1 padding and has a kernel size of 3x3, hence the output shape would be the same as the input shape. Recall: $\frac{\text{dim}+2p-k}{s}$ where dim is the input dimension, p is the padding, k is the kernel-size, and s is the stride in the specific dimension.

$\frac{\text{dim}+2p-k}{s}$

class ResNet(nn.Module):
    def __init__(self):
        super(ResNet, self).__init__() 
        self.model = nn.Sequential(
            nn.Conv2d(3, 64, stride = (2,2), kernel_size=(7,7)),  
            nn.MaxPool2d(stride = (2,2), kernel_size = (7,7)),
            ResidualBlock(64,64), 
            ResidualBlock(64,64), 
            ResidualBlock(64,64), 
            ResidualBlock(64,64), 
            ResidualBlock(64,64), 
            ResidualBlock(64,64), 
            nn.AdaptiveAvgPool2d(output_size=(7, 7)),
            classifier_block(64*7*7, 4096, 1000)

        )
    def forward(self,X):
        return self.model(X)

ResNet(
(model): Sequential(
(0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2))
(1): MaxPool2d(kernel_size=(7, 7), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
(2): ResidualBlock(
(conv1): Sequential(
(0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU()
)
(conv2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
(act): ReLU()
)
(3): ResidualBlock(
(conv1): Sequential(
(0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU()
)
(conv2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
(act): ReLU()
)
(4): ResidualBlock(
(conv1): Sequential(
(0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU()
)
(conv2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
(act): ReLU()
)
(5): ResidualBlock(
(conv1): Sequential(
(0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU()
)
(conv2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
(act): ReLU()
)
(6): ResidualBlock(
(conv1): Sequential(
(0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU()
)
(conv2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
(act): ReLU()
)
(7): ResidualBlock(
(conv1): Sequential(
(0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU()
)
(conv2): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
(act): ReLU()
)
(8): AdaptiveAvgPool2d(output_size=(7, 7))
(9): Sequential(
(0): Flatten()
(1): Linear(in_features=3136, out_features=4096, bias=True)
(2): ReLU()
(3): Dropout(p=0.5, inplace=False)
(4): Linear(in_features=4096, out_features=4096, bias=True)
(5): ReLU()
(6): Dropout(p=0.5, inplace=False)
(7): Linear(in_features=4096, out_features=1000, bias=True)
(8): Softmax(dim=None)
)
)
)

A Slightly More Complex Model

Recurrent Neural Networks are neural networks that perform better with serial data. To understand RNNs, it is integral that we understand the RNN cell first.

The RNN Cell

The RNN Cell is simply a compact linear network that accepts two inputs. It is governed by the following formulae.

$a^{} = f(W_{aa}a^{} + W_{ax}x^{} + b_a)$

$\hat y^{} = g(W_{ya}a^{} + b_y)$

Figure 3. An RNN Cell. Image retrieved from deeplearning.ai Sequence Models course.

It takes in two inputs— the activations from the previous cell, $a^{}$ and the output from the previous layer $x^{}$. These are concatenated and applied to a linear model parameterized by $W_a$ and $b_a$ to get the current cell’s activation. This activation is passed to the next cell and is also transformed linearly parameterized by $W_y$ and $b_y$(optional) to form $\hat y$.

class RNNCell(nn.Module):
    def __init__(self, embed_length,act_dim, output_dim, **kwargs):
        super(RNNCell, self).__init__()

        act = kwargs.get("act", "relu")
        acty = kwargs.get("acty", "relu")
        activation_map = {"relu": nn.ReLU(), "l-relu": nn.LeakyReLU(), "sig": nn.Sigmoid(), "tanh": nn.Tanh()}
        self.act_dim = act_dim

        self.linear = nn.Linear(act_dim+embed_length,act_dim) 
        self.activation = activation_map.get(act)
        self.linear_y = nn.Linear(act_dim, output_dim)
        self.activation_y = activation_map.get(acty)
        pass

    def forward(self, a, X): 
        # X is n x embed_length
        # a is n x act_dim 
        assert(a.shape[1] == self.act_dim)
        # concatenate a and X since they will be transformed by the same Linear.
        X = torch.cat((a,X), axis = 1)
        # transform the inputs
        X = self.linear(X)
        a = self.activation(X)
        X = self.linear_y(a)
        Y = self.activation_y(X)
        return a,Y

        pass

An RNN Layer

An RNN layer is just a series of RNN cells with its activations fed to the next unit. It’s forward function is more complicated as the previous unit’s activation is passed to the next one.

Figure 4. An RNN Layer. Image retrieved from colah’s blog.

The forward function calculates each cell’s activations and outputs first. Calculating each cell from left to right since the next cell requires the previous cell’s activations.

class RNN(nn.Module):
    def __init__(self, **kwargs):
        super(RNN, self).__init__()
        # output shape is 3d. n x output_dim x embed_length
        # input_dim is the embedding length(ie. word embedding length)
        self.input_dim = kwargs.get("input_dim", 0)
        self.act_dim = kwargs.get("act_dim", self.input_dim)
        self.output_dim = kwargs.get("output_dim", 1)
        self.time_steps = kwargs.get("time_steps", self.output_dim) 
        self.unit_output_dim = kwargs.get("unit_output_dim" , 1)

        assert(self.output_dim <= self.time_steps)
        # Populate the layer with the cells based on timesteps.
        self.models = nn.ModuleList([
            RNNCell(self.input_dim,self.act_dim, self.unit_output_dim) 
        ] * self.time_steps )

    def forward(self,  X):
        # x is n x time_steps x embed_length 
        n = X.shape[0] 

        # make sure X axis 1 is less than time_steps, cause it might crash.
        assert(X.shape[1] <= self.time_steps)

        # Sometimes our input would have lesser dimensions than our timesteps, hence we add zero tensors to it
        # to match the size and be compatible.
        X = torch.cat((X, torch.zeros(n, self.time_steps - X.shape[1], self.input_dim)), axis = 1)

        # Initialize the first activation.
        a = torch.zeros(n, self.act_dim)

        # Create the Y array, the individual y-predictions will be stored here
        Y = torch.zeros(n, self.output_dim, self.unit_output_dim)

        # Create iterator for y_i for locating which timestep we are in.
        y_i = 0
        for i,cell in enumerate(self.models):
            # Get the input for the current timestep
            x = X[:, i, :] 
            # Forward the x and activations to the current cell.
            a,y = cell(a, x)  
            # Only add the last predictions for the output of size output_dim
            if i >= self.time_steps - self.output_dim:
                Y[:, y_i,:] = y 
                y_i+=1

        return Y

A Basic RNN

That was hectic! Now let’s put it all together, a Sequential object would do for now to avoid complexity. Let’s stack these RNN Layers together.

model = nn.Sequential(

    RNN(input_dim = embed_length, act_dim = 10, time_steps = 5, output_dim = 4, unit_output_dim = embed_length),
    RNN(input_dim = embed_length, act_dim = 7, time_steps = 4, output_dim = 2, unit_output_dim = embed_length),
    RNN(input_dim = embed_length, act_dim = 3, time_steps = 2, output_dim = 1, unit_output_dim = 1)

)

Sequential(
(0): RNN(
(models): ModuleList(
(0): RNNCell(
(linear): Linear(in_features=22, out_features=10, bias=True)
(activation): ReLU()
(linear_y): Linear(in_features=10, out_features=12, bias=True)
(activation_y): ReLU()
)
(1): RNNCell(
(linear): Linear(in_features=22, out_features=10, bias=True)
(activation): ReLU()
(linear_y): Linear(in_features=10, out_features=12, bias=True)
(activation_y): ReLU()
)
(2): RNNCell(
(linear): Linear(in_features=22, out_features=10, bias=True)
(activation): ReLU()
(linear_y): Linear(in_features=10, out_features=12, bias=True)
(activation_y): ReLU()
)
(3): RNNCell(
(linear): Linear(in_features=22, out_features=10, bias=True)
(activation): ReLU()
(linear_y): Linear(in_features=10, out_features=12, bias=True)
(activation_y): ReLU()
)
(4): RNNCell(
(linear): Linear(in_features=22, out_features=10, bias=True)
(activation): ReLU()
(linear_y): Linear(in_features=10, out_features=12, bias=True)
(activation_y): ReLU()
)
)
)
(1): RNN(
(models): ModuleList(
(0): RNNCell(
(linear): Linear(in_features=19, out_features=7, bias=True)
(activation): ReLU()
(linear_y): Linear(in_features=7, out_features=12, bias=True)
(activation_y): ReLU()
)
(1): RNNCell(
(linear): Linear(in_features=19, out_features=7, bias=True)
(activation): ReLU()
(linear_y): Linear(in_features=7, out_features=12, bias=True)
(activation_y): ReLU()
)
(2): RNNCell(
(linear): Linear(in_features=19, out_features=7, bias=True)
(activation): ReLU()
(linear_y): Linear(in_features=7, out_features=12, bias=True)
(activation_y): ReLU()
)
(3): RNNCell(
(linear): Linear(in_features=19, out_features=7, bias=True)
(activation): ReLU()
(linear_y): Linear(in_features=7, out_features=12, bias=True)
(activation_y): ReLU()
)
)
)
(2): RNN(
(models): ModuleList(
(0): RNNCell(
(linear): Linear(in_features=15, out_features=3, bias=True)
(activation): ReLU()
(linear_y): Linear(in_features=3, out_features=1, bias=True)
(activation_y): ReLU()
)
(1): RNNCell(
(linear): Linear(in_features=15, out_features=3, bias=True)
(activation): ReLU()
(linear_y): Linear(in_features=3, out_features=1, bias=True)
(activation_y): ReLU()
)
)
)
)

That's it for now! We could go on here with all the different types of RNNs but this post is getting too long already. I could continue with the more complicated models if this post gains enough traction so make sure to share with your friends!
'til next time!

This article is also published in Medium.

DEV Community