DEV Community

Eric Leung
Eric Leung

Posted on

Explanation for the notation of linear models

In the Elements of Statistical Learning book, the chapter on Supervised Learning describes equation for a linear model as a vector (Eq 2.1) and matrix form (Eq 2.2). Both are equivalent.

As a vector is another word for a one dimensional array.

We can write this with parentheses going horizontal like

A=(1,5,7,3,2) A = (1, 5, 7, 3, 2)

but we can also write this in a more matrix form with square brackets in a vertical form

A=[15732] A = \begin{bmatrix} 1 \\ 5 \\ 7 \\ 3 \\ 2 \end{bmatrix}

Now we reach Equation 2.1 below.

Y^=β0^+j=1pXjβj^ \hat{Y} = \hat{\beta_0} + \sum_{j=1}^p X_j \hat{\beta_j}

This equation is a fancy way to write out an equation you might see for linear regression like this

Y=23+3x1+3.6x2 Y = -23 + 3x_1 + 3.6x_2

Now going back to our Equation 2.1, what do these variables mean? Here's the equation again.

Y^=β0^+j=1pXjβj^ \textcolor{orange}{\hat{Y}} = \hat{\beta_0} + \sum_{j=1}^p X_j \hat{\beta_j}

First, Y^\hat{Y} is the dependent variable value, or output vector. This will contain the values for what you want to predict. The caret symbol ^ here is called a "hat". This a hypothetical value we find from our model prediction. So we can read Y^\hat{Y} as "Y hat".

Y^=β0^+j=1pXjβj^ \hat{Y} = \textcolor{orange}{\hat{\beta_0}} + \sum_{j=1}^p X_j \hat{\beta_j}

The next variable we run into is β0\beta_0 . This is "the intercept, also known as the bias in machine learning."

On an X-Y plot you have seen in grade school, this is where the line crosses the vertical y-axis. Another way to think about this intercept is the baseline value for your dependent value when your independent variable predictors are all zero.

Y^=β0^+j=1pXjβj^ \hat{Y} = \hat{\beta_0} + \textcolor{orange}{\sum_{j=1}^p} X_j \hat{\beta_j}

Now, let's gears to talking about some notation. There is a large "E" looking symbol with some numbers and letters, j=1p\sum_{j=1}^p . We call the "E" symbol "sigma" and is a fancy way to say "add these things up".

What things are we adding up? And how? The Xjβj^X_j \hat{\beta_j} is actually a representation of matrices of XjX_j being a row in the matrix below and βj^\hat{\beta_j} as a single value in the column vector.

Xβ^=(x1,1x1,2x1,nx2,1x2,2x2,nxm,1xm,2xm,n)(β1β2βm) X \hat{\beta} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,n} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m,1} & x_{m,2} & \cdots & x_{m,n} \end{pmatrix} \begin{pmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{m} \end{pmatrix}

Similarly, the β0^\hat{\beta_0} is a column vector:

(β1β2βm) \begin{pmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{m} \end{pmatrix}

Multiplying everything together will produce a single column vector, Y^\hat{Y} .

We all do this so that we can run equations like

Y=23+3x1+3.6x2 Y = -23 + 3x_1 + 3.6x_2

this over and over again across multiple sets of values, which are encoded as rows in the square matrices above XX .

Top comments (0)