DEV Community

Cover image for Vision Transformer : An image is worth 16*16 words
Rohit Gupta
Rohit Gupta

Posted on

Vision Transformer : An image is worth 16*16 words

In computer vision, however, convolutional architectures remain dominant. Inspired by NLP successes, multiple works try combining CNN-like architectures with self-attention, some replacing the convolutions entirely. The latter models, while theoretically efficient, have not yet been scaled effectively on modern hardware accelerators due to the use of specialized attention patterns.
Inspired by the Transformer scaling successes in NLP,in this research paper a standard Transformer was applied directly to images, with the fewest possible modifications An Image is worth 16 x 16 Words.

When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks
(ImageNet, CIFAR-100,VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

In particular, the best model reaches the accuracy of 88.55% on ImageNet, 90.72% on ImageNet-ReaL, 94.55% on CIFAR-100, and 77.63% on the VTAB suite of 19 tasks.

Problem with CNN:
CNNs use kernels to aggregate very local information in each layer which than is passed to next layer which again use kernels to aggregate the local information.Hence CNN starts to look very locally. But Vision transformer resolved this problem.

How Transformers resolves it?
It considers a very large field of view from the very beginning. It overcomes the limitation of CNNs that looked very narrowly at first.Also, there is no decoder layer, instead there is an extra linear layer for the final classification called the MLP head.
The Transformer look at the data by taking an input image and splitting it into patches of 16*16 pixels.
Input Image
Input images splitted into 16*16 patches

All the patches are treated as simple tokensand are than flattened and lower dimensional linear embeddings are produced.
Than we add positional embeddings to the vector. Now this sequence is feeded into standard transformer encoder.
Model is pretrained with image labels.(fully supervised huge dataset.)
Finally, network is finetuned on the downstream dataset.

While resolving a NLP problem, input(like incomplete sentences) are first converted into numeric indices(by creating a vocabulary dictionary from the words present in training data) and then are fed into Transformers.

Positional Embedding :Positional encoding is a re-representation of the values of a word and its position in a sentence (given that is not the same to be at the beginning that at the end or middle).
But you have to take into account that sentences could be of any length, so saying '"X" word is the third in the sentence' does not make sense if there are different length sentences: 3rd in a 3-word-sentence is completely different to 3rd in a 20-word-sentence.
What a positional encoder does is to get help of the cyclic nature of sin(x) and cos(x) functions to return information of the position of a word in a sentence.
Source :

In all, total 3 variants of ViTwere proposed
Image description
Hidden size D is the embedding size, which is kept fixed throughout the layers.
Transformers are better in general because they can be scaled up.

Image description

Problems which still need to be resolved
Transformers are unfocused in the initial epochs but later become focused to make write predictions after some training and therefore are more data hungry than CNNs.
Transformers find out very original and unexpected ways to look into data(input images) as there is no element in the architecture to tell the model how to do that exactly.However, CNNs are focussed on local view from the beginning by the convolutions.

Transformers lack the inductive biases of Convolutional Neural Networks (CNNs), such as translation invariance and a locally restricted receptive field. Invariance means that you can recognize an entity (i.e. object) in an image, even when its appearance or position varies. Translation in computer vision implies that each image pixel has been moved by a fixed amount in a particular direction

The key takeaway of this work is the formulation of an image classification problem as a sequential problem by using image patches as tokens, and processing it by a Transformer. That sounds good and simple but it needs massive data and very high computational power. If ViT is trained on datasets with more than 14M images it can approach or beat state-of-the-art CNNs.

Further Reading : Official Paper

That's all folks.

If you have any doubt ask me in the comments section and I'll try to answer as soon as possible.
If you love the article follow me on Twitter: []
If you are the Linkedin type, let's connect:

Have an awesome day ahead 😀!

Top comments (0)