Each year ILSVRC winners conveyed some interesting insights and 2014 was special in that regard. For most years the challenge tasks were:
- Image classification: Predict the classes of objects present in an image.
- Single-object localization: Image classification + draw a bounding box around one example of each object present.
- Object detection: Image classification + draw a bounding box around each object present.
By 2014 it was apparent that as more and more fresh architecture unveiled, no one single CNN architecture could champion all the tasks and 2014 winners were perfect embodiment of it.
VGGNet was introduced in the paper titled Very Deep Convolutional Networks for Large-Scale Image Recognition by Karen Simonyan and Andrew Zisserman. VGGNet architecture won the competition in localization task while bagging 2nd position in the classification task. The beauty of this network lies in its architectural simplicity and reinforcing the idea of having deeper CNNs for improved performance.
Improvements over top CNN Architectures
Since 2012, there had been numerous attempts to improve over AlexNet in every possible way. In 2013, both Overfeat and ZFNet improved their performance compared to AlexNet, by utilizing smaller receptive window size (7 × 7) and smaller stride (2) in their first convolutional layer. Small-size convolution filters were previously used by Dan Ciresan Net, but their nets were significantly less deep, and they did not evaluate the large-scale dataset. In VGGNet, authors used petite 3 × 3 receptive fields (which is the smallest size to capture the notion of left/right, up/down, center) throughout the 16 and 19 layer deep networks with a stride of 1 and padding of 1 so that the spatial resolution is preserved after each convolution. Optional spatial pooling was carried out by a max-pooling layer of lowered size 2 x 2 with a stride of 2, instead of 3 x 3. The reason behind such implementation as provided by authors are :
- 2 or 3 consecutive 3x3 layers has an effective receptive field of 5x5 or 7x7 respectively
- as every convolutional operation is followed with a non-linear ReLU activation, so multiples of them make the decision function more discriminating than a single ReLU,
- a layer of 7x7 convolutional filters with C channels has 7C x 7C parameters, whereas a layer of 3x3 convolutional filters with C channels has just 3C x 3C parameters, 81% less.
HowardNet and Overfeat also improved their performance by utilizing similar multiple scaling of images during both training and testing of the network, instead of using a single scale as AlexNet.
Training
Training and evaluation of 140 million parameters VGGNet were performed on 4 NVIDIA Titan Black GPUs installed in a single system. Multi-GPU training exploits data parallelism and is carried out by splitting each batch of training images into several GPU batches, processed in parallel on each GPU. After the GPU batch gradients are computed, they are averaged to obtain the gradient of the full batch. Gradient computation is synchronous across the GPUs, so the result is the same as when training on a single GPU.
Despite the larger number of parameters and the greater depth of VGGNet compared to AlexNet it required fewer epochs to converge, to which authors conjecture that might be due to implicit regularization imposed by the greater depth and smaller convolutional filter sizes and pre-initialization of certain layers.
VGGNet does not contain Local Response Normalization as in AlexNet because such normalization does not improve the performance instead leads to increased memory consumption and computation time.
Preprocessing: The mean value of pixels over the training set was subtracted from each pixel.
Image Augmentation:
Single scale training: Authors first trained the network using images scaled to 256. Then to speed-up training of the network with images scaled to 384, it was initialized with the weights pre-trained with that of scale 256, and they used a smaller initial learning rate of 1e-03
Multiscale training: Where each training image was individually rescaled by randomly sampling scales from a certain range [Smin, Smax] where Smin = 256 and Smax = 512. For speed reasons, authors trained multi-scale models by fine-tuning all layers of a single-scale model with the same configuration, pre-trained with a fixed scale of 384.
Multi-Crop: Finally, to feed the network with the fixed-size 224×224 input images, rescaled training images were randomly cropped (one crop per image - per SGD iteration). To further augment the training set, the crops underwent random horizontal flipping and random RGB color shift as done during AlexNet.
Dropout: Same as AlexNet
Kernel Initializer: Same
Bias Initializer: 0 for each layer
Batch Size: 256
Optimizer: Same
L2 weight decay: Same
Learning Rate Manager: Same
Total epochs: 74
Total time: 21 days (max for VGG-19)
Results
Test Time Augmentation: During test time the network was applied densely over the rescaled test images in a way similar to Overfeat. Namely, the fully connected layers were first converted to convolutional layers (the first FC layer to a 7 × 7 convolutional layer, the last two FC layers to 1 × 1 convolutional layers). The resulting fully convolutional net was then applied to the whole (uncropped) images. The result was a class score map with the number of channels equal to the number of classes, and a variable spatial resolution, dependent on the input image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled). Authors also augment the test set by horizontally flipping the images; the soft-max class posteriors of the original and flipped images were averaged to obtain the final scores for the images.
Authors justify using dense evaluation methods instead of multi-crop evaluation (performed during AlexNet) due to decreased computation time, although the methods being complementary to each other (due to different convolution boundary conditions) were used together for better results.
While applying a CNN to a cropped image, the convolved feature maps were padded with zeros, while for dense evaluation the padding for the same cropped image naturally came from the neighboring parts of the image (due to both the convolutions and spatial pooling), which substantially increased the overall network receptive field, so more context was captured.
Single Scale Evaluation: The test image size was set as follows:
- Q = S for fixed training image scale S, and
- Q = 0.5(Smin + Smax) for jittered S ∈ [Smin, Smax].
Authors observed that the classification error decreased with the increased ConvNet depth: from 11 layers in A to 19 layers in E (but saturated after that). The scale jittering at training time (S ∈ [256; 512]) lead to significantly better results than training on images with the fixed smallest side (S = 256 or S = 384), even though a single scale is used at test time. This confirmed that training set augmentation by scale jittering was indeed helpful for capturing multi-scale image statistics.
The Least performing Network A achieving 10.4% top-5 error confirmed that a deep network with small filters outperforms a shallow network with larger filters.
Multi-Scale Evaluation: Here, the authors assessed the effect of scale jittering at test time. It consisted of running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. The model trained with variable S ∈ [Smin; Smax] was evaluated over a larger range of sizes Q = {Smin, 0.5(Smin + Smax), Smax}.
The results indicated that scale jittering at test time leads to better performance as compared to evaluating the same model at a single scale.
Multi-Crop and Dense: As mentioned earlier best-performing networks D (VGG16) and E (VGG19) achieved slightly better results with Multi-Crop and Dense evaluation together.
Final submission ensembling VGG16 and VGG19
Similar to AlexNet(2012) and ZFNet(2013) submissions authors too submitted an ensemble (combining the outputs of several models by averaging their soft-max class posteriors) of their best performing models D and E; just two, a significantly less number of models than earlier submissions and remarkably outperforming them. The final submitted top-5 error of 6.8% outperformed all earlier submitted results.
Remarks
VGGNet, simplicity at its best architecture compared to its competitor GoogLeNet, had few but important insights to offer. The use of now omnipresent 3x3 convolutional layers throughout an architecture was seeded here. Both VGGNet and GoogLeNet, the winners of 2014, using the concept of effective receptive field highlighted the importance of depth in visual representations which eventually became the stepping stone of a breakthrough transformation arriving next year.
Top comments (0)