ZFNet was introduced in the paper titled Visualizing and Understanding Convolutional Networks by Matthew D. Zeiler and Rob Fergus. This architecture did not win the competition, but its inference was implemented by winner of that year (Clarifai founded by Zeiler, 11.19% test error). This paper is remarkable because of its visualizations and understanding of the internal operation and behavior of a CNN model classifying an image. The paper also introduced us to a technique now widely known as Transfer Learning.
Due to 2012 winner AlexNet, there was an enormous increase in submission of CNN models for ILSVRC 2013 but most of them were trial-and-error based without exhibiting any understanding of how and why CNN performed so well.
Let's understand that (as explained by authors).
A CNN model
- Maps a color 2D input image
x_i
, via a series of layers, to a probability vectory_i_hat
over theC
different classes, where each layer consists of
1. Convolution of the previous layer output with a set of learned filters, passing the responses through a rectified linear function
2. Optionally max pooling over local neighborhoods
3. Optionally a local contrast operation that normalizes the responses across feature maps (it's not relevant anymore)
- has conventional fully connected top few layers with final layer as a softmax classifier.
- is trained using a large set of
N
labelled images{x, y}
, where labely_i
is a discrete variable indicating the true class. - cross-entropy loss function -
p(x)log(q(x))
, suitable for image classification, is used to comparey_hat
andy
. - parameters are trained by backpropagating the derivative of the loss regarding the parameters throughout the network, and updating the parameters via stochastic gradient descent in batches.
Updating AlexNet
Understanding the operation of a CNN requires interpreting the feature activity in intermediate layers, so authors present a novel way known as DeconvNet (Zeiler et al. proposed it initially as unsupervised learning technique) to map these activities back to the input pixel space, showing what input pattern, originally, had caused a given activation in the feature maps.
A DeconvNet is attached to each of its ConvNet layers, providing a continuous path back to image pixels. To examine a given ConvNet activation, all other activations in the layer are set to zero and the feature maps are passed as input to the attached DeconvNet layer. Then it is successively
1. unpooled (uses the switch which records the location of the local max in maxpool),
2. rectified, and
3. filtered (uses transposed version of same filters in convnet)
to reconstruct the activity in the layer beneath, that gave rise to the chosen activation. This is repeated until input pixel space is reached.
They train AlexNet reproducing test error percentage within 0.1% of reported value in 2012. By visualizing the first and second layers of AlexNet, they observe two specific issues:
- Filters at layer 1 are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Without the mid frequencies, there is a chain effect that deep features can only learn from extremely high and low frequency information.
Note: Spatial frequency information in an image describes the information on periodic distributions of 'light' and 'dark' in that image. High spatial frequencies correspond to features such as sharp edges and fine details, whereas low spatial frequencies correspond to features such as global shape.
- Layer 2 shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. Aliasing occurs when sampling frequency is too low.
Note : In each CNN layer (if not using Upsampling or DeconvNet) we are mainly sampling down (discretization) the image. If sampling frequency is too low (insufficient sampling) then we get aliasing effects on the sampled image such as jagged boundaries/edges, repetitive textures etc.
To remedy these problems, authors made following changes in AlexNet Architecture:
Made the stride of the convolution 2, rather than 4. A filter of stride of 2 proved to retain a lot of pixel information
This new architecture retains much more information in the 1st and 2nd layer features. So final ZFNet architecture looks like this :
Training
During training, visualization of the first layer filters revealed that, a few of them dominated. To combat this, authors renormalized each filter in the convolutional layers to a fixed radius of RMS value of 1e-01.
The model was trained on the ImageNet 2012 training set (1.3 million images, spread over 1000 different classes) on single NVIDIA GTX 580 GPU with 3 GB memory.
Preprocessing:
Same as AlexNet.
Image Augmentation:
Same as AlexNet. (224x224 here)
Dropout:
Same as AlexNet.
Kernel Initializer:
1e-02 for each layer
Bias Initializer:
0 for each layer
Batch Size:
Same
Optimizer:
Same
L2 weight decay:
None
Learning Rate Manager:
Same.
Total epochs:
70
Total time:
12 days
Results
Single ZFNet model achieves top-1 and top-5 test errors of 38.4% and 16.5% respectively, lower by a margin of 1.7% than that of AlexNet. Their final submission comprised of an ensemble of 6 CNNs (average of 5 ZFNet's and a network same as ZFNet but layer Conv3, Conv4, Conv5 with 512, 1024, 512 channels respectively) which gave an error rate of 14.8%.
Depth of the model is important for obtaining good performance:
Removing two fully connected layers yielded a slight increase in error, although they contained the majority of model parameters. Removing two of the middle convolutional layers also made a relatively small difference to the error rate. However, removing both the middle convolution layers and the fully connected layers yielded a model with only 4 layers whose performance was dramatically worse.
Transfer Learning:
Finally, authors showed that model trained on ImageNet generalizes well to other datasets. For this, they kept layers 1-7 of the ImageNet trained model fixed and train a new softmax classifier on top (for the appropriate number of classes) using the training images of the new dataset.
Visualizations
Feature Visualization
The projections from each layer show the hierarchical nature of the features in the network. Layer 2 responds to corners and other edge/color conjunctions.
Layer 3 has more complex invariance, capturing similar textures such as mesh patterns and text patterns.
Layer 4 shows significant variation, but is more class-specific such as dog faces and bird’s legs.
Layer 5 shows entire objects with significant pose variation such as keyboards and dogs.
Feature Evolution during Training
Here, the lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.
Feature Invariance
Small transformations have a dramatic effect in the first layer of the model, but a lesser impact at the top feature layer, being quasi linear for translation & scaling. The network output is stable to translations and scaling. In general, the output is not invariant to rotation, except for object with rotational symmetry (e.g. entertainment center).
Occlusion Sensitivity
With these image classification approaches, a natural question arises : Is model truly identifying the location of the object in the image, or just using the surrounding context?
Authors attempt to answer this question by systematically occluding different portions of the input image with a gray square, and monitoring the output of the classifier. Above examples show visualizations from the strongest feature map of the top convolution layer, in addition to activity in this map (summed over spatial locations) as a function of 'occluder' position. It clearly shows that the model is localizing the objects within the scene, as the probability of the correct class and activity in the feature map drops significantly when the object is occluded. This shows that the model, while trained for classification, is highly sensitive to local structure in the image and is not just using broad scene context.
Remarks
Thus, the paper holds its significance for introducing us to the perspective we require while structuring a CNN architecture. The visualization techniques introduced here to visualize the activity within the model are still relevant for inferring the performance of models or determining data preprocessing techniques for obtaining better results. Authors brought this fact to the limelight that CNN models do not generate features with random, non-interpretable patterns (black box - as thought by many) but revealing several intuitively desirable properties such as compositionality, increasing invariance and class discrimination as we ascend the layers of a CNN model.
Top comments (0)