Before diving into CNN, lets get the background clear.
Deep learning is a type of machine learning and artificial intelligence (AI) that follow the way humans get certain types of grasp. Deep learning is an chief part of data science, which contains statistics and predictive modeling. It is very helpful to data scientists who are tasked with collecting, analyzing and interpreting large amounts of data; deep learning makes this procedure easier and faster.
When the volume of data expands, Machine learning techniques, no matter how enhanced, starts to become ineffective in terms of execution and accuracy, whereas Deep learning performs so much better in such cases.
The Neural Network is the heart of deep learning models, and it was at first designed to copy the working of the neurons in the human brain. They are comprised of node layers, containing an input layer, one or more hidden layers, and an output layer. Each node links to another and has an associated threshold and weight. If the output of any individual node is above the described threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.
Now, Lets understand Convolutional Neural Network.
One of the main parts of Neural Networks is Convolutional neural networks (CNN). CNNs use image recognition and classification in order to detect objects, recognize faces, etc. They are made up of neurons with learnable weights and biases. Each specific neuron receives numerous inputs and then takes a weighted sum over them, where it passes it through an activation function and responds back with an output.
CNNs are primarily used to classify images, cluster them by similarities, and then perform object recognition. Many algorithms using CNNs can identify faces, street signs, animals, etc.
Before we go to the working of CNN’s let’s cover the basics such as what is an image and how is it represented. An RGB image is nothing but a matrix of pixel values having three planes whereas a grayscale image is the same but it has a single plane.
Read the required input data, scale it to pixel dimension between 0–255 and specify the range of bandwidth either grayscale or RGB
The first layer in a CNN network is the CONVOLUTIONAL LAYER, which is the core building block and does most of the computational heavy lifting. Data or imaged is convolved using filters or kernels. Filters are small units that we apply across the data through a sliding window. The depth of the image is the same as the input, for a color image that RGB value of depth is 4, a filter of depth 4 would also be applied to it. This process involves taking the element-wise product of filters in the image and then summing those specific values for every sliding action. The output of a convolution that has a 3d filter with color would be a 2d matrix. For example, imagine as if a flashlight shines its light and covers a 5 x 5 area. And now, imagine this flashlight sliding across all the areas of the input image. This flashlight is called a filter(or sometimes referred to as a neuron or a kernel) and the region that it is shining over is called the receptive field. This filter is also an array of numbers (the numbers are called weights or parameters).
Pooling is done to reduce the dimensionality of the input image, which involves downsampling of features. It is applied through every layer in the 3d volume. Typically there are hyperparameters within this layer:
The dimension of spatial extent: which is the value of n which we can take N cross and feature representation and map to a single value
Stride: which is how many features the sliding window skips along the width and height
There are two main types of pooling:
As the filter moves across the input, it selects the pixel with the maximum value to send to the output array. As an aside, this approach tends to be used more often compared to average pooling.
As the filter moves across the input, it calculates the average value within the receptive field to send to the output array.
A common POOLING LAYER uses a 2 cross 2 max filter with a stride of 2, this is a non-overlapping filter. A max filter would return the max value in the features within the region. Example of max pooling would be when there is 26 across 26 across 32 volume, now by using a max pool layer that has 2 cross 2 filters and astride of 2, the volume would then be reduced to 13 crosses, 13 crosses 32 feature map.
Fully Connected layers in a neural networks are those layers where all the inputs from one layer are connected to every activation unit of the next layer. This involves transforming the entire pooled feature map matrix into a single column which is then fed to the neural network for processing. With the fully connected layers, we combined these features together to create a model. Finally, we have an activation function such as softmax or sigmoid to classify the output.
It takes the input and return the output using appropriate activation function.
- Provide input image into convolution layer
- Choose parameters, apply filters with strides, padding if requires. Perform convolution on the image and apply ReLU activation to the matrix.
- Perform pooling to reduce dimensionality size
- Add as many convolutional layers until satisfied
- Flatten the output and feed into a fully connected layer (FC Layer)
- Output the class using an activation function (Logistic Regression with cost functions) and classifies images.
Cover : pixabay.com
Neural Network : giphy.com