And first, i'm gonna tell you another secret: there's no magic, just only math 😵
This article is based on my previous one. If you don't read it yet, it's time to do that! I will use the same formulas and try to explain them. Let's go!
I'm gonna solve XOR again 😅 It's not a joke, bro! There are many data science books start with solving it 😎 One more time i remind you XOR input table.
To demonstrate it let's use the following structure of neural network.
Here we have 2 neurons in input layer, 4 in hidden and 1 in output layer.
The main goal of neural network training is adjusting the weights to minimize the output error. In most cases, the weights is initializing randomly and during neural net training these ones is adjusting by backpropagation algorithm.
So, let's initialize the weights randomly from
[0, 1] range.
Graphically, it looks like this.
Ok, let's compute neuron inputs. I will use only one input case to save time:
1 so the output will be
So, for the first neuron in the hidden layer:
net1_h = 0 * 0.2 + 1 * 0.6 = 0.6 /** 1..n, n = 2 (2 neurons in the input layer) 0 value of the first input element 1 value of the second input element 0.2 the weight from first input neuron to first hidden 0.6 the weight from second input neuron to first hidden Understand, bro? 😏 */
For second one and others:
net2_h = 0 * 0.5 + 1 * 0.7 = 0.7 net3_h = 0 * 0.4 + 1 * 0.9 = 0.9 net4_h = 0 * 0.8 + 1 * 0.3 = 0.3
f(x) = 1 / (1 + exp(-x)) deriv(x) = f(x) * (1 - f(x))
So, now we apply our activation to each of computed net:
output1_h = f(net1_h) = f(0.6) = 0.64 output2_h = f(net2_h) = f(0.7) = 0.66 output3_h = f(net3_h) = f(0.9) = 0.71 output4_h = f(net4_h) = f(0.3) = 0.57
We've got the output values for each neuron in the hidden layer. Graphically, it looks like this:
And now, when we've got output values for hidden layer neurons we can calculate the output value for the output layer.
net_o = 0.64 * 0.6 + 0.66 * 0.7 + 0.71 * 0.3 + 0.57 * 0.4 = 1.28 output_o = f(net_o) = f(1.28) = 0.78
And here we go.
Bro, look at the output value. What do you see?
0.78 right? If you remember the XOR table you know that we should have got
1 for this case
0 1, but we've got
0.78. That's called an error. Let's calculate that.
target = 1 error = target - output_o = 1 - 0.78 = 0.22
Now, we need to calculate the delta error. In general, that's the value by which you adjust the weights.
You can use this site for sigmoid derivative calculation.
delta_error = deriv(output_o) * error = deriv(0.78) * 0.22 = 0.21 * 0.22 = 0.04
Let's do the same for each neuron in the hidden layer. The formula is different a little bit.
We need to calculate the error for each neuron. Remember it, bro. Let's get started!
error1_h = delta_error * 0.6 = 0.04 * 0.6 = 0.024 error2_h = delta_error * 0.6 = 0.04 * 0.7 = 0.028 error3_h = delta_error * 0.6 = 0.04 * 0.3 = 0.012 error4_h = delta_error * 0.6 = 0.04 * 0.4 = 0.016
And again the delta!
delta_error1_h = deriv(output1_h) * error1_h = deriv(0.64) * 0.024 = 0.22 * 0.024 = 0.005 delta_error2_h = deriv(output2_h) * error2_h = deriv(0.66) * 0.028 = 0.224 * 0.028 = 0.006 delta_error3_h = deriv(output3_h) * error3_h = deriv(0.71) * 0.012 = 0.220 * 0.012 = 0.002 delta_error4_h = deriv(output4_h) * error4_h = deriv(0.57) * 0.016 = 0.23 * 0.016 = 0.003
Now, we have all variables to update the weights. The formulas look like this.
Let's start from the hidden to the output.
learning_rate = 0.001 hidden_to_output_1 = old_weight + output1_h * delta_error * learning_rate = 0.6 + 0.64 * 0.04 * 0.001 = 0.6000256 hidden_to_output_2 = old_weight + output2_h * delta_error * learning_rate = 0.7 + 0.66 * 0.04 * 0.001 = 0.7000264 hidden_to_output_3 = old_weight + output3_h * delta_error * learning_rate = 0.3 + 0.71 * 0.04 * 0.001 = 0.3000284 hidden_to_output_4 = old_weight + output4_h * delta_error * learning_rate = 0.4 + 0.57 * 0.04 * 0.001 = 0.4000228
We've got the values too close to the old weights. It's because we chose the learning rate too small. It's a very important hyper parameter. When you choose it too small - your network will training for years 😄 Otherwise, when it's a large number - your network will train faster, but it's accuracy may be low for new data. So you have to choose it correctly. The optimal value is in range between
Ok, let's do the same for the input to the hidden synapses.
//for the first hidden neuron input_to_hidden_1 = old_weight + input_0 * delta_error1_h * learning_rate = 0.2 + 0 * 0.005 * 0.001 = 0.2 input_to_hidden_2 = old_weight + input_1 * delta_error1_h * learning_rate = 0.6 + 1 * 0.005 * 0.001 = 0.600005 //for the second one input_to_hidden_3 = old_weight + input_0 * delta_error2_h * learning_rate = 0.5 + 0 * 0.006 * 0.001 = 0.5 input_to_hidden_4 = old_weight + input_1 * delta_error2_h * learning_rate = 0.7 + 1 * 0.006 * 0.001 = 0.700006 //for the third one input_to_hidden_5 = old_weight + input_0 * delta_error3_h * learning_rate = 0.4 + 0 * 0.002 * 0.001 = 0.4 input_to_hidden_6 = old_weight + input_1 * delta_error3_h * learning_rate = 0.9 + 1 * 0.002 * 0.001 = 0.900002 //for the fourth one input_to_hidden_7 = old_weight + input_0 * delta_error4_h * learning_rate = 0.8 + 0 * 0.003 * 0.001 = 0.8 input_to_hidden_8 = old_weight + input_1 * delta_error4_h * learning_rate = 0.3 + 1 * 0.003 * 0.001 = 0.300003
That's it! Finally 😉
Oh, finally we did all the math stuff! But we only did that for one training set -
1. For our problem we solve (XOR) we have 4 training sets (see the table above). That means you have to do the same calculations we just did above for each training set! Brrr, that's terrible 😑 Too much math 😆
So, in machine learning when you do one forward propagation step (from the input layer to the output) and one backward (from the output layer to the input) for one training set it's called an iteration. Another important term is epoch. Epoch counter is iterating when you pass through your neural network all the training sets. In our case, we have 4 training sets. One iteration means one training set passed through neural network. When all training sets passed through a network - here we have one epoch. Then: 4 iterations equals 1 epoch. Understand, bro? 🤗 In general, more epochs - a higher accuracy, less epochs - a lower accuracy.
That's it. No magic, only math. Hope, you've understood it, bro 😊 See ya! Happy coding 😇