Neural Style Transfer is an optimization technique used to take two images i.e content image and style reference image and blend them such that the output image looks like the content image but painted in a style of a style reference image.
It is a deep learning technique that generates artistic images. It extracts the structural features from the content image, whereas the style features from the style image.
The deep convolution neural networks develop the representations of the images. As we move deeper in the network it will take care of structural features. Reconstruction from the lower layer will reproduce the exact image. In contrast, the higher layer's reconstruction will capture the high-level content and hence we refer to the feature responses from the higher layer as the content representation.
To extract the representation of the style content, we build a feature space on the top of the filter responses in each network layer. It consists of the correlations between the different filter responses over the spatial extent of the feature maps. The filter correlation of different layers captures the texture information of the input image. This creates images that match a given image's style on an increasing scale while discarding information of the global arrangement. This multi-scale representation is called style representation.
This can be easily understood by the diagram given below.
The above architecture of the model proposed in the paper “A Neural Algorithm of Artistic Style”. Here we will use a pre-trained VGG-19 model for content and style reconstruction. By putting structural information from content representation and texture information from style representation together we will generate an artistic image. A strong emphasis on style will result in images that match the artwork's appearance, effectively giving a texturized version of it, but hardly show any of the photograph’s content. When placing a strong emphasis on content, one can identify the photograph, but the painting style is not as well-matched. We perform the gradient descent on the generated image to find another image that matches the original image's feature responses.
You can install PyTorch from here.
import torch import torch.nn as nn import torchvision import torchvision.models as models import torchvision.transforms as transforms import torch.optim as optim from torchvision.utils import save_image from PIL import Image import matplotlib.pyplot as plt
We will use VGG-19 model from
VGG-19 is a convolutional neural network that is 19 layers deep. You can load a pre-trained version of the network trained on more than a million images from the ImageNet database. The pretrained network can classify images into 1000 object categories, such as a keyboard, mouse, pencil, and many animals. As a result, the network has learned rich feature representations for a wide range of images. The network has an image input size of 224-by-224.
- A fixed size of (224 * 224) RGB image was given as input to this network which means that the matrix was of shape (224,224,3).
- Used kernels of (3 * 3) size with a stride size of 1 pixel, this enabled them to cover the whole notion of the image.
- Spatial padding was used to preserve the spatial resolution of the image.
- MaxPooling has performed over 2 * 2 pixel windows with stride 2.
- This was followed by a Rectified linear unit(ReLu) to introduce non-linearity to make the model classify better and to improve computational time.
- implemented three fully connected layers from which the first two were of size 4096 and after that, a layer with 1000 channels for 1000-way ILSVRC classification and the final layer is a softmax function.
Now let's load the model
model = models.vgg19(pretrained=True)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
This makes sure if the device has
GPU then it will load our model in GPU otherwise in
we will define a class that will help us to provide feature representations of intermediate layers(as they are complex feature extractor). In this we will use 'block_conv1_1', 'block_conv2_1','block_conv3_1', 'block_conv4_1', 'block_conv5_1' layers whose index values are 0, 5, 10, 19, 28 respectively and then store these activations of 5 convolutional layers in an array and return the array.
class VGG(nn.Module): def __init__(self): super(VGG,self).__init__() self.req_features = ['0','5','10','19','28'] self.model = models.vgg19(pretrained=True).features[:29] def forward(self,x): features =  for layer_num,layer in enumerate(self.model): x = layer(x) if(str(layer_num) in self.req_features): features.append(x) return features
Preprocessing is required to make an image suitable for the model.
we will perform preprocessing using
torch.transform() like image resizing and converting image into Tensor.
we will define the function with an argument as the path of the image which will return a preprocessed image.
def image_loader(path): image = Image.open(path) loader = transforms.Compose([transforms.Resize((512,512)), transforms.ToTensor()]) image = loader(image).unsqueeze(0) return image.to(device,torch.float)
unsqueeze() is used to add extra dimension at 0th index for batch size.
Now, use the image_loader function to load the style and the content image from the local disk. We will use the content image clone as the input base image or the generated image. Since the gradient descent will alter the generated image's pixel values, we will pass the parameter true for
original_image = image_loader('/content/mountain.jpg') style_image = image_loader('/content/style.jpg') generated_image = original_image.clone().requires_grad_(True)
Here we will describe two loss functions i.e 1. Content Loss 2. Style Loss.
The content loss function ensures that the activations of higher layers are similar between content and generated image. The style loss function ensures that the correlation of all the layers are similar between style and generated image.
Content Loss Function
generated_image are passed into a model and output is extracted using intermediate layers using
VGG class that we have defined above. Then we will calculate the Euclidean Distance between the output of the
content_image.Therefore content loss for layer1 is
def calc_content_loss(gen_feat,orig_feat): content_l = torch.mean((gen_feat - orig_feat)**2) return content_l
Style Loss Function
To calculate style loss we need to compute Gram Matrix. A gram matrix is a multiplication of a matrix with its transposed matrix.
The style loss of layer l is the squared error between the gram matrices of the intermediate representation of the
generated_image and style image.
Where Eₗ is the style loss for layer l, Nₗ and Mₗ are the numbers of channels and height times width in the feature representation of layer l respectively. Gˡᵢⱼ and Aˡᵢⱼ are the intermediate representation of gram matrices of the
generated_image and style image respectively.
Therefore overall style loss is
Here w^l is a weight factor contributing to each layer of total style loss.
def calc_style_loss(gen,style): batch_size,channel,height,width = gen.shape G = torch.mm(gen.view(channel,height*width),gen.view(channel,height*width).t()) A = torch.mm(style.view(channel,height*width),style.view(channel,height*width).t()) style_l = torch.mean((G-A)**2) return style_l
def calculate_loss(gen_features,orig_features,style_features): style_loss=content_loss=0 for gen,con,style in zip(gen_features,orig_features,style_features): content_loss += calc_content_loss(gen,con) style_loss += calc_style_loss(gen,style) total_loss = alpha*content_loss + beta*style_loss return total_loss
before training, we should set our hyperparameters and optimizer.
I have chosen Adam Optimizer but if you want to can try out with LBFGS Optimizer(Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory) using
model = VGG().to(device).eval() epoch = 6000 lr = 0.004 alpha = 8 beta = 70 optimizer = optim.Adam([generated_image],lr=lr)
for loop we will iterate over the number of epochs. extract feature representations of intermediate layers of content, style, generated image using
model. Then calculate the loss function using above define function i.e
calculates_loss().Set gradient to zero using
optimizer.zero_grad() then backpropagate the loss using
total_loss.backward() and update weights(gradient descent) using
for i in range(epoch): gen_features = model(generated_image) orig_features = model(original_image) style_features = model(style_image) total_loss = calculate_loss(gen_features,orig_features,style_features) optimizer.zero_grad() total_loss.backward() optimizer.step() if(not(i%100)): print(total_loss) save_image(generated_image,'gen.png')
So in this blog, we learned how Neural Style Transfer works. We loaded the pre-trained VGG-19 model then preprocess the image, then define content and style loss functions, which combined to calculate the total loss function and finally we ran our model and get the artistic image as output.
GitHub Link - Here