This week is definitely one of the more difficult to get through. The exercise has a high number of complex, moving parts that all need to move together in order to complete.

Often times when struggling with concepts I look for other explanations. I found Zeyuan Hu's write up on the math for the exercise a great aggregation of all the math for the week: *Andrew Ngs ML Week 04 - 05*.

I also discovered some jewels in the course forums. The course forums are generally a horror story of duplicated questions, but if you look hard enough you can find some serious help from the course mentors.

*Programming Exercise Tutorial (List)* will be my first stop before starting any exercises from here on out. The mentor provided tutorials typically provide a nigh step by step write up to completing any exercise.

## Mistakes Were Made

The common errors I ran into were:

- Incorrect Dimensions
- Matrix Multiplication Caveats

# Incorrect Dimensions

*ex4 tutorial for nnCostFunction and backpropagation* goes in depth on how each matrix dimension should match.

I also found these debugging tips to be extraordinarily helpful combined with knowing how to use the Octave Debugger.

# Matrix Multiplication

I *really* prefer using the vectorized approach for any solution in the course. They are generally far easier to write, but you can also get really mixed up because matrix multiplication has a lot of rules and conditions.

The cost function with the double sigmas made me really want to use matrix multiplication to do summation, but I ended up using the easier element wise multiplication and summation.

However, the forum post *Computing the NN cost J using the matrix product* explains the intuition needed in order to correctly apply matrix multiplication in this case.

I'm writing mine out in Octave as a way to translate the material between two mediums. So we begin with the two mxK (m = 2 & K = 3) matrices:

```
A = [1 2; 3 4; 5 6]
B = [7 8; 9 10; 11 12]
```

We want the sum over the rows and columns of the element wise product:

```
sum(sum(A .* B))
ans = 217
```

Naively, I expect I can reach an equivalent answer by the matrix product of transpose A and B.

```
A' * B
ans =
89 98
116 128
```

Surprise! its an mxm (2x2) matrix because when you multiply m x K and K x m you get an m x m matrix.

The cost function is supposed to reduce to a scalar value though. What's missing? As the mentor's article points out we just need to take the trace or the sum of the main diagonal of a *square* matrix for our answer:

```
trace(A' * B)
ans = 217
```

Does this always work? In this instance, where we want the sum of the columns and rows of two matrices of equal dimension: yes. This is not always the case though.

Matrix multiplication requires mxn * nxp dimensions. So only the columns of the first must match the rows of the second. You won't necessarily receive a square matrix.

The Hadamard Product a.k.a element wise multiplication only requires that the two matrices have the same dimensions. Again, not necessarily resulting in a square matrix.

Specifically to the above cost function though, this *does* always work because `y`

and `x`

are both mxK. Intuitively this makes sense. If X is a matrix of (instances) x (features) so that each row is an instance with columns for each feature then each row of `y`

is an instance's "actual" value. That alone doesn't work because `y`

is typically either a column vector or its columns correspond to the number of classes as in the case of digit recognition.

The reason this *does* always work for the cost function is we are not dealing with X, but `h(X)`

which contains the "predicted values" for each instance in X therefore having the same dimensions as y. Therefore the formula can multiply `h(X)`

and y because they have the same dimensions and we can sum the result using trace because their product is a square matrix.

I won't make guarantees here though as this really only applies to the double sigma part of the equation. Evaluating the terms inside the equation is another story, but I wager if I encountered this equation again in another exercise I could build the vectorized version now that I understand more of the rules.

It is also interesting to note that the mentor points out the vectorized implementation may not necessarily perform better due to a lot of wasted calculations that are thrown away after using trace. So if you built the element wise version like myself give yourself a pat on the back.

Those are my notes for the week! Best of luck!

## Discussion (0)