Explanation for the notation of linear models

#machinelearning #statistics #math #esl

In the Elements of Statistical Learning book, the chapter on Supervised Learning describes equation for a linear model as a vector (Eq 2.1) and matrix form (Eq 2.2). Both are equivalent.

As a vector is another word for a one dimensional array.

We can write this with parentheses going horizontal like

A = (1, 5, 7, 3, 2)

but we can also write this in a more matrix form with square brackets in a vertical form

A = \begin{bmatrix} 1 \\ 5 \\ 7 \\ 3 \\ 2 \end{bmatrix}

Now we reach Equation 2.1 below.

\hat{Y} = \hat{\beta_0} + \sum_{j=1}^p X_j \hat{\beta_j}

This equation is a fancy way to write out an equation you might see for linear regression like this

Y = -23 + 3x_1 + 3.6x_2

Now going back to our Equation 2.1, what do these variables mean? Here's the equation again.

\textcolor{orange}{\hat{Y}} = \hat{\beta_0} + \sum_{j=1}^p X_j \hat{\beta_j}

First, $\hat{Y}$ is the dependent variable value, or output vector. This will contain the values for what you want to predict. The caret symbol ^ here is called a "hat". This a hypothetical value we find from our model prediction. So we can read $\hat{Y}$ as "Y hat".

\hat{Y} = \textcolor{orange}{\hat{\beta_0}} + \sum_{j=1}^p X_j \hat{\beta_j}

The next variable we run into is $\beta_0$ . This is "the intercept, also known as the bias in machine learning."

On an X-Y plot you have seen in grade school, this is where the line crosses the vertical y-axis. Another way to think about this intercept is the baseline value for your dependent value when your independent variable predictors are all zero.

\hat{Y} = \hat{\beta_0} + \textcolor{orange}{\sum_{j=1}^p} X_j \hat{\beta_j}

Now, let's gears to talking about some notation. There is a large "E" looking symbol with some numbers and letters, $\sum_{j=1}^p$ . We call the "E" symbol "sigma" and is a fancy way to say "add these things up".

What things are we adding up? And how? The $X_j \hat{\beta_j}$ is actually a representation of matrices of $X_j$ being a row in the matrix below and $\hat{\beta_j}$ as a single value in the column vector.