DEV Community


Notes on Dr. Andrew Ng's "Machine Learning" Week 5

Fulton Byrne
I have an interest in high throughput data processing and a love for the programming languages that help!
・4 min read

This week is definitely one of the more difficult to get through. The exercise has a high number of complex, moving parts that all need to move together in order to complete.

Often times when struggling with concepts I look for other explanations. I found Zeyuan Hu's write up on the math for the exercise a great aggregation of all the math for the week: Andrew Ngs ML Week 04 - 05.

I also discovered some jewels in the course forums. The course forums are generally a horror story of duplicated questions, but if you look hard enough you can find some serious help from the course mentors.

Programming Exercise Tutorial (List) will be my first stop before starting any exercises from here on out. The mentor provided tutorials typically provide a nigh step by step write up to completing any exercise.

Mistakes Were Made

The common errors I ran into were:

  1. Incorrect Dimensions
  2. Matrix Multiplication Caveats

Incorrect Dimensions

ex4 tutorial for nnCostFunction and backpropagation goes in depth on how each matrix dimension should match.

I also found these debugging tips to be extraordinarily helpful combined with knowing how to use the Octave Debugger.

Matrix Multiplication

I really prefer using the vectorized approach for any solution in the course. They are generally far easier to write, but you can also get really mixed up because matrix multiplication has a lot of rules and conditions.

The cost function with the double sigmas made me really want to use matrix multiplication to do summation, but I ended up using the easier element wise multiplication and summation.

cost function

However, the forum post Computing the NN cost J using the matrix product explains the intuition needed in order to correctly apply matrix multiplication in this case.

I'm writing mine out in Octave as a way to translate the material between two mediums. So we begin with the two mxK (m = 2 & K = 3) matrices:

A = [1 2; 3 4; 5 6]
B = [7 8; 9 10; 11 12]
Enter fullscreen mode Exit fullscreen mode

We want the sum over the rows and columns of the element wise product:

sum(sum(A .* B))
ans =  217 
Enter fullscreen mode Exit fullscreen mode

Naively, I expect I can reach an equivalent answer by the matrix product of transpose A and B.

A' * B

ans =

    89    98
   116   128
Enter fullscreen mode Exit fullscreen mode

Surprise! its an mxm (2x2) matrix because when you multiply m x K and K x m you get an m x m matrix.

The cost function is supposed to reduce to a scalar value though. What's missing? As the mentor's article points out we just need to take the trace or the sum of the main diagonal of a square matrix for our answer:

trace(A' * B)
ans = 217
Enter fullscreen mode Exit fullscreen mode

Does this always work? In this instance, where we want the sum of the columns and rows of two matrices of equal dimension: yes. This is not always the case though.

Matrix multiplication requires mxn * nxp dimensions. So only the columns of the first must match the rows of the second. You won't necessarily receive a square matrix.

The Hadamard Product a.k.a element wise multiplication only requires that the two matrices have the same dimensions. Again, not necessarily resulting in a square matrix.

Specifically to the above cost function though, this does always work because y and x are both mxK. Intuitively this makes sense. If X is a matrix of (instances) x (features) so that each row is an instance with columns for each feature then each row of y is an instance's "actual" value. That alone doesn't work because y is typically either a column vector or its columns correspond to the number of classes as in the case of digit recognition.

The reason this does always work for the cost function is we are not dealing with X, but h(X) which contains the "predicted values" for each instance in X therefore having the same dimensions as y. Therefore the formula can multiply h(X) and y because they have the same dimensions and we can sum the result using trace because their product is a square matrix.

I won't make guarantees here though as this really only applies to the double sigma part of the equation. Evaluating the terms inside the equation is another story, but I wager if I encountered this equation again in another exercise I could build the vectorized version now that I understand more of the rules.

It is also interesting to note that the mentor points out the vectorized implementation may not necessarily perform better due to a lot of wasted calculations that are thrown away after using trace. So if you built the element wise version like myself give yourself a pat on the back.

Those are my notes for the week! Best of luck!

Discussion (0)