Convolutional neural network (ConvNet/CNN) is an algorithm used particularly in Computer Vision with Deep Learning and classification. It is a type of artificial neural network. We’ve used too many terms, let’s break it down one by one.
Computer Vision - Computer vision is a branch of artificial intelligence (AI) that allows computers and systems to extract useful information from digital photos, videos, and other visual inputs, as well as to conduct actions or make suggestions based on that data.
Deep Learning - Deep learning is a machine learning approach that allows computers to learn by example in the same way that people do.
Classification - A class label is predicted for a particular example of input data in classification, which is a predictive modelling issue.
Artificial Neural Network - A computer network based on biological neural networks that create the structure of the human brain is known as an artificial neural network. Artificial neural networks, like human brains, include neurons that are coupled to each other in various levels of the networks. These neurons are called nodes.
They are distinguished from other neural networks by their superior performance with image, speech, or audio signal inputs. There are 3 main layers:
- Convolutional layer
- Pooling layer
- Fully connected layer (FC layer)
The CNN becomes more complicated with each layer, detecting larger areas of the picture. Earlier layers concentrate on basic elements like colors and borders. As the visual data travels through the CNN layers, it begins to distinguish bigger components or features of the item, eventually identifying the target object.
Let us dive deeper into what happens in each layer.
All the major computation occurs in this layer. We require 3 components in this layer, the input data, a filter, and a feature map. The filter is also known as a kernel or feature detector. This filter moves across receptive fields of the image to check if a particular feature is present or not. This process is called convolution.
Let us assume that the input is a color image and is a matrix of pixels in 3D. This means that it will have RGB values for height, width, and breadth. The feature detector is a 2D array of weights which represents a part of the image. The filter is typically a 3x3 matrix and determines the size of the receptive field.
The filter is applied to an area of the image, and a dot product is calculated between the input pixels and the filter. This dot product is then fed into the output array. The filter, then shifts by stride, and keeps repeating the process until the entire input image is covered by the filter. The final output from the series of dot products from the input and the filter is known as feature map, activation map, or a convolved feature.
This layer is also known as downsampling, conducts dimensionality reduction, reducing the number of parameters in the input. It is required to decrease the computational power required to process the data. Furthermore, it aids in extracting dominant features, thus maintaining the process of effectively training the model.
Similar to the convolutional layer, it sweeps the filter across the entire input data, but the difference is this filter does not have weights. Instead, the kernel uses aggregation functions on values within the receptive field, populating the output array.
There are 2 types of pooling:
a. Max Pooling: The filter returns the maximum value from the portion of the image covered by the kernel. This type is used more often.
b. Average Pooling: The filter returns the average of all the values from the portion of the image covered by the kernel.
We have successfully enabled the model to grasp the features after going through the above method. After that, we'll flatten the final output and input it to a standard Neural Network for classification.
The pixel values of the input image are not directly connected to the output layer in partially connected layers. However, in this layer, each node in the output layer connects directly to a node in the previous layer. This layer performs the task of classification based on features extracted from previous layers and various filters.
In convolutional layers and pooling layers, we usually use ReLu functions, while we usually use softmax activation function in this layer to classify inputs properly. This function returns a probability ranging between 0 and 1.
Hope this gave you a brief introduction on what Convolutional Neural Networks are. :)
Convolutional Neural Networks