Daniel Cooper

Posted on Feb 4, 2023 • Edited on Mar 31, 2023

Neural networks 101: The basics of forward and backward propagation

#machinelearning #neuralnetworks #beginners

This post shows you how to construct the forward propagation and backpropagation algorithms for a simple neural network. It may be helpful to you if you've just started to learn about neural networks and want to check your logic regarding the maths.

Consider the neural network below. It has two inputs (x₁ and x₂), two hidden neurons (h₁ and h₂), one output neuron (o₁) and a series of weights and biases.

We want to use this neural network to predict outputs and have four training examples which show the expected output y for different combinations of x₁ and x₂.

x₁	x₂	y
0	0	0
0	1	1
1	0	1
1	1	1

Our neural network starts with a random set of weights between 0 and 1, and biases set to 0. In this initial configuration, as you might expect, our neural network will predict outputs which bear no relation to the expected outputs.

To improve the prediction accuracy, we need to optimise the weights and biases so that they work together to predict outputs that are as close as possible to the expected outputs.

This optimisation process is called training and it involves two stages: forward propagation and backpropagation.

Forward propagation

The forward pass stage moves through the neural network from left to right.

Each neuron is configured to sum its weighted inputs with its bias and to then pass that value z through an activation function $\sigma$ to produce an output a.

We can express the sum z and output a for each neuron as:

\begin{align*} z_{h1} &= w_1x_1 + w_3x_2 + b_1 \newline a_{h1} &= \sigma(z_{h1}) \newline\newline z_{h2} &= w_2x_1 + w_4x_2 + b_2 \newline a_{h2} &= \sigma(z_{h2}) \newline\newline z_{o1} &= w_5a_{h1} + w_6a_{h2} + b_3 \newline a_{o1} &= \sigma(z_{o1}) \end{align*}

The activation function $\sigma$ is used to transform a continuous input into an output between 0 and 1, with 0 for large negative inputs and 1 for large positive inputs.

Here we'll apply the commonly-used Sigmoid function for activation:

\sigma(z) = \frac{\mathrm{1} }{\mathrm{1} + e^{-z}}

With these equations, we can use the x₁ and x₂ from each of the training examples to predict an output (a_o1).

To measure the prediction accuracy of our neural network across all of the training examples, we can the Mean Squared Error function to find the total loss J:

\displaystyle \text{Total loss, J} = \frac{1}{2m}\sum_{i=1}^{m}(a_{o1} - y)^2

where m is the total number of training examples; in our case, m = 4.

To begin, the total loss J is almost certain to be greater than zero unless we have been particularly fortunate with our random weights and biases.

More likely we will need to optimise the weights and biases and to do that, we use backpropagation.

Backpropagation

Backpropagation moves backwards through the neural network, from right to left.

Using equation from above for the total loss J, if we consider just a single training example with m = 1, we can define a loss L:

\displaystyle \text{Loss, L} = \frac{1}{2}(a_{o1} - y)^2

The loss L is dependent on a_o1 which itself is dependent on all of the weights and biases. It follows that optimising these weights and biases will reduce the loss L.

Using an iterative process, we can optimise each weight and bias using the principle of gradient descent and a suitable learning rate $\alpha$ :

\begin{align*} w_s &= w_s - \alpha\dfrac{\mathrm{d}L}{\mathrm{d}w_s} \text{ where s = 1, 2, ..., 6} \newline\newline b_t &= b_t - \alpha\dfrac{\mathrm{d}L}{\mathrm{d}b_t} \text{ where t = 1, 2, 3} \end{align*}

To use these equations, we first need expressions for the derivative of the loss L with respect to each weight and bias. These expressions are partial derivatives which we can derive using the chain rule.

Before we start, there are two derivatives that we will come in handy. The first is the derivative of the Sigmoid activation function:

\begin{align*} \sigma(z) &= \frac{\mathrm{1} }{\mathrm{1} + e^{-z}} \newline\newline \dfrac{\mathrm{d}\sigma(z)}{\mathrm{d}z} &= \sigma(z)(1-\sigma(z)) \end{align*}

The second is the derivative of the loss L with respect to a_o1:

\begin{align*} \displaystyle L &= \frac{1}{2}(a_{o1} - y)^2 \newline\newline \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}} &= (a_{o1} - y) \end{align*}

Let's now start with weight w₆ and find the derivative of loss L with respect to w₆:

\dfrac{\mathrm{d}L}{\mathrm{d}w_6}

Using the chain rule, we can say that:

\dfrac{\mathrm{d}L}{\mathrm{d}w_6} = \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_6}

Therefore:

\dfrac{\mathrm{d}L}{\mathrm{d}w_6} = (a_{o1} - y).\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_6}

This is progress but we need to go further with:

\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_6}

Using the chain rule again, we can say:

\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_6} = \dfrac{\mathrm{d}a_{o1}}{\mathrm{d}z_{o1}}.\dfrac{\mathrm{d}z_{o1}}{\mathrm{d}w_6}

where:

\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}z_{o1}} = \dfrac{\mathrm{d}\sigma(z_{o1})}{\mathrm{d}z_{o1}} = \sigma(z_{o1})(1-\sigma(z_{o1}))

and:

\dfrac{\mathrm{d}z_{o1}}{\mathrm{d}w_6} = \dfrac{\mathrm{d}(w_5a_{h1} + w_6a_{h2} + b_3)}{\mathrm{d}w_6} = a_{h2} = \sigma(z_{h2})

Combining these partial derivatives, we get an expression for w₆ that can be computed:

\begin{align*} \dfrac{\mathrm{d}L}{\mathrm{d}w_6} &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_6} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}z_{o1}}.\dfrac{\mathrm{d}z_{o1}}{\mathrm{d}w_6} \newline\newline &= (a_{o1} - y).\sigma(z_{o1})(1-\sigma(z_{o1})).\sigma(z_{h2}) \end{align*}

Using the same approach for w₅ gives:

\begin{align*} \dfrac{\mathrm{d}L}{\mathrm{d}w_5} &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_5} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}z_{o1}}.\dfrac{\mathrm{d}z_{o1}}{\mathrm{d}w_5} \newline\newline &= (a_{o1} - y).\sigma(z_{o1})(1-\sigma(z_{o1})).\sigma(z_{h1}) \end{align*}

For b₃:

\begin{align*} \dfrac{\mathrm{d}L}{\mathrm{d}b_3} &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}b_3} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}z_{o1}}.\dfrac{\mathrm{d}z_{o1}}{\mathrm{d}b_3} \newline\newline &= (a_{o1} - y).\sigma(z_{o1})(1-\sigma(z_{o1})) \end{align*}

which takes a shorter form because:

\dfrac{\mathrm{d}z_{o1}}{\mathrm{d}b_3} = \dfrac{\mathrm{d}(w_5a_{h1} + w_6a_{h2} + b_3)}{\mathrm{d}b_3} = 1

Next, we need to repeat the exercise with weights w₄, w₃, w₂, w₁ and biases b₂ and b₁.

Let's start with weight w₄ and find the derivative of loss L with respect to w₄:

\dfrac{\mathrm{d}L}{\mathrm{d}w_4}

Using the chain rule, we can say that:

\dfrac{\mathrm{d}L}{\mathrm{d}w_4} = \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_4}

Therefore:

\dfrac{\mathrm{d}L}{\mathrm{d}w_4} = (a_{o1} - y).\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_4}

Once again, this is progress but we need to go further with:

\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_4}

Using the chain rule again, we can say:

\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_4} = \dfrac{\mathrm{d}a_{o1}}{\mathrm{d}a_{h2}}.\dfrac{\mathrm{d}a_{h2}}{\mathrm{d}w_4}

where:

\begin{align*} \dfrac{\mathrm{d}a_{o1}}{\mathrm{d}a_{h2}} &= \dfrac{\mathrm{d}a_{o1}}{\mathrm{d}z_{o1}}.\dfrac{\mathrm{d}z_{o1}}{\mathrm{d}a_{h2}} \newline\newline &= \dfrac{\mathrm{d}\sigma(z_{o1})}{\mathrm{d}z_{o1}}.\dfrac{\mathrm{d}(w_5a_{h1} + w_6a_{h2} + b_3)}{\mathrm{d}a_{h2}} \newline\newline &= \sigma(z_{o1})(1-\sigma(z_{o1})).w_6 \end{align*}

and:

\begin{align*} \dfrac{\mathrm{d}a_{h2}}{\mathrm{d}w_4} &= \dfrac{\mathrm{d}a_{h2}}{\mathrm{d}z_{h2}}.\dfrac{\mathrm{d}z_{h2}}{\mathrm{d}w_4} \newline\newline &= \dfrac{\mathrm{d}\sigma(z_{h2})}{\mathrm{d}z_{h2}}.\dfrac{\mathrm{d}(w_2x_1 + w_4x_2 + b_2)}{\mathrm{d}w_4} \newline\newline &= \sigma(z_{h2})(1-\sigma(z_{h2})).x_2 \end{align*}

Finally for w₄:

\begin{align*} \dfrac{\mathrm{d}L}{\mathrm{d}w_4} &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_4} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}a_{h2}}.\dfrac{\mathrm{d}a_{h2}}{\mathrm{d}w_4} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}z_{o1}}.\dfrac{\mathrm{d}z_{o1}}{\mathrm{d}a_{h2}}.\dfrac{\mathrm{d}a_{h2}}{\mathrm{d}z_{h2}}.\dfrac{\mathrm{d}z_{h2}}{\mathrm{d}w_4} \newline\newline &= (a_{o1} - y).\sigma(z_{o1})(1-\sigma(z_{o1})).w_6.\sigma(z_{h2})(1-\sigma(z_{h2})).x_2 \end{align*}

Using the same approach for w₃ gives:

\begin{align*} \dfrac{\mathrm{d}L}{\mathrm{d}w_3} &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_3} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}a_{h1}}.\dfrac{\mathrm{d}a_{h1}}{\mathrm{d}w_3} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}z_{o1}}.\dfrac{\mathrm{d}z_{o1}}{\mathrm{d}a_{h1}}.\dfrac{\mathrm{d}a_{h1}}{\mathrm{d}z_{h1}}.\dfrac{\mathrm{d}z_{h1}}{\mathrm{d}w_3} \newline\newline &= (a_{o1} - y).\sigma(z_{o1})(1-\sigma(z_{o1})).w_5.\sigma(z_{h1})(1-\sigma(z_{h1})).x_2 \end{align*}

For w₂:

\begin{align*} \dfrac{\mathrm{d}L}{\mathrm{d}w_2} &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_2} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}a_{h2}}.\dfrac{\mathrm{d}a_{h2}}{\mathrm{d}w_2} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}z_{o1}}.\dfrac{\mathrm{d}z_{o1}}{\mathrm{d}a_{h2}}.\dfrac{\mathrm{d}a_{h2}}{\mathrm{d}z_{h2}}.\dfrac{\mathrm{d}z_{h2}}{\mathrm{d}w_2} \newline\newline &= (a_{o1} - y).\sigma(z_{o1})(1-\sigma(z_{o1})).w_6.\sigma(z_{h2})(1-\sigma(z_{h2})).x_1 \end{align*}

For w₁:

\begin{align*} \dfrac{\mathrm{d}L}{\mathrm{d}w_1} &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}w_1} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}a_{h1}}.\dfrac{\mathrm{d}a_{h1}}{\mathrm{d}w_1} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}z_{o1}}.\dfrac{\mathrm{d}z_{o1}}{\mathrm{d}a_{h1}}.\dfrac{\mathrm{d}a_{h1}}{\mathrm{d}z_{h1}}.\dfrac{\mathrm{d}z_{h1}}{\mathrm{d}w_1} \newline\newline &= (a_{o1} - y).\sigma(z_{o1})(1-\sigma(z_{o1})).w_5.\sigma(z_{h1})(1-\sigma(z_{h1})).x_1 \end{align*}

For b₂:

\begin{align*} \dfrac{\mathrm{d}L}{\mathrm{d}b_2} &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}b_2} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}a_{h2}}.\dfrac{\mathrm{d}a_{h2}}{\mathrm{d}b_2} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}z_{o1}}.\dfrac{\mathrm{d}z_{o1}}{\mathrm{d}a_{h2}}.\dfrac{\mathrm{d}a_{h2}}{\mathrm{d}z_{h2}}.\dfrac{\mathrm{d}z_{h2}}{\mathrm{d}b_2} \newline\newline &= (a_{o1} - y).\sigma(z_{o1})(1-\sigma(z_{o1})).w_6.\sigma(z_{h2})(1-\sigma(z_{h2})) \end{align*}

which takes a shorter form because:

\dfrac{\mathrm{d}z_{h2}}{\mathrm{d}b_2} = \dfrac{\mathrm{d}(w_2x_1 + w_4x_2 + b_2)}{\mathrm{d}b_2} = 1

For b₁:

\begin{align*} \dfrac{\mathrm{d}L}{\mathrm{d}b_1} &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}b_1} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}a_{h1}}.\dfrac{\mathrm{d}a_{h1}}{\mathrm{d}b_1} \newline\newline &= \dfrac{\mathrm{d}L}{\mathrm{d}a_{o1}}.\dfrac{\mathrm{d}a_{o1}}{\mathrm{d}z_{o1}}.\dfrac{\mathrm{d}z_{o1}}{\mathrm{d}a_{h1}}.\dfrac{\mathrm{d}a_{h1}}{\mathrm{d}z_{h1}}.\dfrac{\mathrm{d}z_{h1}}{\mathrm{d}b_1} \newline\newline &= (a_{o1} - y).\sigma(z_{o1})(1-\sigma(z_{o1})).w_5.\sigma(z_{h1})(1-\sigma(z_{h1})) \end{align*}

which also takes a shorter form because:

\dfrac{\mathrm{d}z_{h1}}{\mathrm{d}b_1} = \dfrac{\mathrm{d}(w_1x_1 + w_3x_2 + b_1)}{\mathrm{d}b_1} = 1

Training

With expressions for the derivative of the loss L with respect to each weight and bias, we can now train our neural network using the gradient descent equations that we saw earlier:

\begin{align*} w_s &= w_s - \alpha\dfrac{\mathrm{d}L}{\mathrm{d}w_s} \text{ where s = 1, 2, ..., 6} \newline\newline b_t &= b_t - \alpha\dfrac{\mathrm{d}L}{\mathrm{d}b_t} \text{ where t = 1, 2, 3} \end{align*}

Here's the step-by-step process for training:

Step 1. Take the first training example
- Calculate the current outputs from each neuron using the forward propagation equations
- Update the weights and biases using the gradient descent equations
Step 2. Repeat step 1 with each of the remaining training examples
Step 3. Repeat the entire process another 10,000 epochs

After 10,000 epochs, the total loss J of our neural network should reduce, leading to an improvement in prediction accuracy.

In a perfect scenario, plotting the total loss J against the number of iterations performed will reveal something like this:

With this sort of outcome, our trained neural network should do pretty well at predicting outputs that are very close to the expected outputs. For example:

x₁	x₂	y	a_o1
0	0	0	0.04
0	1	1	0.98
1	0	1	0.98
1	1	1	1.00

If we find the total loss J has not reduced at all or sufficiently after training, further optimisation may be achieved by experimenting with different learning rates $\alpha$ (e.g. 0.1, 0.01, 0.001). That's another topic in its own right...

Summary

Congratulations if you made it to the end and thanks for reading!

Here's what we covered, some in more detail than others:

Forward propagation
Sigmoid activation function
Concept of loss L and total loss J
Backpropagation using partial derivatives and the chain rule
Training to minimise the total loss J

Please comment below if you found this post useful or if you've spotted an error.

DEV Community

Neural networks 101: The basics of forward and backward propagation

Forward propagation

Backpropagation

Training

Summary

Top comments (0)

Read next

Elastic Load Balancing (ELB): Ensuring High Availability and Reliability

Day 18: Deploying Docker to the Cloud

How to Optimize Your React Web App: 7 Essential Steps

anyone wanna join me building my first app?(I'm a beginner bare with me)