written by Raji Ayinla as a contribution to techsmosh
When I say, “the year 2250.” What thought pops into your head? It’s probably a world full of artificially intelligent machines re-enacting every possible scenario that science fiction has presented to us. Students in the 23rd century might study the history of artificial intelligence in a computer science course, scoffing at our nascent machine learning algorithms just as we scoff at the ENIAC. But just as the ENIAC--and many other Turing-complete machines--laid the foundation for supercomputers, neural networks are laying the groundwork for androids that can mimic Picasso’s artistic flair.
It’s common practise to conceptualize an artificial neural network(ANN) as a biological neuron. Technically speaking, an artificial neural network, as Ada Lovelace put it, is the “calculus of the nervous system.” A really, really simplified version of the nervous system. So much so that a few neuroscientists have probably lost sleep over the analogy.
A good way to visualize a neural network is not to think of biological neurons. Instead, imagine a car wash. Yes, a car wash. When you take your grime-stained car to your local car wash, you’re expecting to input the car. Now, imagine that the washing process is hidden from you. When the car wash outputs your car, it’s suddenly clean.
So, we can surmise that you input your car(x),it was forwarded to a hidden layer(h), and then it was forwarded to an output layer(y). This is a gross oversimplification, but what this analogy does depict is forward propagation. Of course, there is more going on in a neural network than the forwarding of data.
Neural networks consist of an input layer, a single, or multiple, hidden layer(s), and an output layer. Note that layers are counted from the first hidden layer. The more hidden layers you have, the more complex the network is. Four or more hidden layers constitute a deep network. The neural network depicted in the above image is considered to be a 6X4X3X1 network.
Let’s take a look at the common components shared by all layers: synapses and neurons.
Synapses take the value of the input stored from the previous layer and multiplies it by a weight. The value of this weight can range from -1 and 1. As you’ll find out, deep learning is all about calibrating the weights to a specific value to yield an accurate output. Think of a shower where you adjust the hot and cold knobs to get your ideal temperature. Afterwards, the synapses forward propagate their results to the neurons.
Neurons, on the other hand, have a larger responsibility. They have to add up all of the (Weights * X values) and include a bias.
Here’s what the barebones equation looks like:
Well, picture a scale. A bias sits on the left platform of the scale. This left platform is marked with a 0. On the other platform, you have the summation. The number associated with the right platform is 1. The larger the bias, the harder it is for that neuron to output a 1. The smaller the bias, the easier it is for the neuron to output a 1.
Implementing a bias allows you to offset an input that has a 0 value. Since weights are multiplied by the input along synapses and anything times 0 equals 0, you will have a situation where the final output is bizarrely different from what you expected it to be A great argument for the need for a bias can be found in this stackoverflow forum.
So we have our solution. We’re not finished yet, though. The neuron needs the potential to either fire or not. This is achieved through an activation function. There are several activation functions out there, and they all have their unique flavors. Exploring each one in depth is beyond the scope of this article, but we’ll take a look at a step function and a sigmoid function to show the contrasts between the two.
A step function is like an on/off switch. Its binary. For anyone familiar with programming, you know that step functions are equivalent to boolean conditionals. Take this pseudocode example:
threshold = 0.5.
If Y > threshold, output 1.
Else If Y < 1 threshold, output 0.
In terms of neural networks, step functions work fine if the network can clearly identify a class. Unfortunately, image identification almost always involves more than one class. For example, digits ranging from 0-9 are the classes used when trying to train a neural network with the dataset of handwritten digits found in the MNIST database. What if all of the neurons fired because all the summations met the threshold? The results will be disastrous.
The solution is to use a nonlinear activation function, like a sigmoid function. When you pass your summation output into the sigmoid function, the range of your output will be between 0 and 1.
Now that we have decided on our activation function. We can move on to the most advertised portion of machine learning: the training stage.
When you were a toddler, you probably made plenty of mistakes, causing your parents to rebuke you to show you how wrong you were. After a few more scoldings you were trained to see the error of your ways. Similarly, we train neural networks by determining the amount of error in a prediction through the use of a cost function. The mean square error is a common cost function used for this purpose.
The desired value is compared to the prediction. Once we attain the error value we need to figure out how to minimize the cost function. Theoretically, all we need to do is to adjust the weights in order to change the cost. The lower our cost, the more accurate our result will be. Remember our shower analogy? Well, think of all the weights along the synapses as individual knobs. We can adjust these knobs manually by computing every possibility, but now imagine that there are not two knobs, but millions of knobs to adjust.The curse of dimensionality prevents anyone from trying to guess values via a brute force method.
Tired of analogies? Well, here’s a parable of our solution. It’s of a blind man who overshoots the location of his camp. The distance away from the camp is his error. He knows that his camp rests on the lowest elevation. He decides to minimize his error by moving if he senses that he’s going downhill. If he is in fact moving downhill, he will continue to increases his momentum, confident that he’s nearing the bottom.
This is essentially how gradient descent works. What’s happening is a mathematical process called differentiation. We find the derivative of the mean square error. This determines the rate of change. If it is negative, the cost function is going downhill. If it is positive, it is going uphill. We then adjust the weights accordingly.
This weight adjustment process is called backpropagation.You can boil down backpropagation to the chain rule. In calculus, the chain rule is used to multiply the derivative of an outer function by the derivative of the inner function. With this rule, no matter how many hidden layers a network has, you will always be able to work your way from the nth layer to the first. Let’s reimagine a 3X2X1 layer as nested functions f(f(XW^(1)) W^(2)).
You can equate backpropagation to taking apart a nested matryoshka doll. We’re popping out one function at a time, multiplying the inner by the outer until we reach our root layer. On the other hand, forward propagation is like putting together a matryoshka doll. When you combine forward propagation and backpropagation, you create a loop that results in incremental weight adjustments that result in decreased error values. After multiple iterations, the weights will stabilize and the end result is an optimized output.
In the beginning, I mentioned that neural networks may result in artistically inclined androids. Yes, this may be a fantasy. Some data scientists will be ecstatic if their network’s ImageNet classification accuracy can peak over 80 percent. This still shouldn’t discourage the most ardent Asimov fan, however, because data classification is at the forefront of AI nowadays. You know all those data science buzzwords you keep seeing in tech articles, and how deep learning seems to show up 9 out of 10 times? Well, recent breakthroughs in data classification is to blame for all the recent hoopla.
The primary type of neural network involved in these classifications is called a Conventional Neural Network, or CNN, or ConvNets--whichever floats your boat. Neural networks have many flavors depending on the problem at hand.They continue to break barriers and will be a prominent part of the technological landscape in the years to come.