## DEV Community is a community of 866,220 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

# Explanation for the notation of linear models

In the Elements of Statistical Learning book, the chapter on Supervised Learning describes equation for a linear model as a vector (Eq 2.1) and matrix form (Eq 2.2). Both are equivalent.

As a vector is another word for a one dimensional array.

We can write this with parentheses going horizontal like

$A = (1, 5, 7, 3, 2)$

but we can also write this in a more matrix form with square brackets in a vertical form

$A = \begin{bmatrix} 1 \\ 5 \\ 7 \\ 3 \\ 2 \end{bmatrix}$

Now we reach Equation 2.1 below.

$\hat{Y} = \hat{\beta_0} + \sum_{j=1}^p X_j \hat{\beta_j}$

This equation is a fancy way to write out an equation you might see for linear regression like this

$Y = -23 + 3x_1 + 3.6x_2$

Now going back to our Equation 2.1, what do these variables mean? Here's the equation again.

$\textcolor{orange}{\hat{Y}} = \hat{\beta_0} + \sum_{j=1}^p X_j \hat{\beta_j}$

First, $\hat{Y}$ is the dependent variable value, or output vector. This will contain the values for what you want to predict. The caret symbol ^ here is called a "hat". This a hypothetical value we find from our model prediction. So we can read $\hat{Y}$ as "Y hat".

$\hat{Y} = \textcolor{orange}{\hat{\beta_0}} + \sum_{j=1}^p X_j \hat{\beta_j}$

The next variable we run into is $\beta_0$ . This is "the intercept, also known as the bias in machine learning."

On an X-Y plot you have seen in grade school, this is where the line crosses the vertical y-axis. Another way to think about this intercept is the baseline value for your dependent value when your independent variable predictors are all zero.

$\hat{Y} = \hat{\beta_0} + \textcolor{orange}{\sum_{j=1}^p} X_j \hat{\beta_j}$

Now, let's gears to talking about some notation. There is a large "E" looking symbol with some numbers and letters, $\sum_{j=1}^p$ . We call the "E" symbol "sigma" and is a fancy way to say "add these things up".

What things are we adding up? And how? The $X_j \hat{\beta_j}$ is actually a representation of matrices of $X_j$ being a row in the matrix below and $\hat{\beta_j}$ as a single value in the column vector.

$X \hat{\beta} = \begin{pmatrix} x_{1,1} & x_{1,2} & \cdots & x_{1,n} \\ x_{2,1} & x_{2,2} & \cdots & x_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{m,1} & x_{m,2} & \cdots & x_{m,n} \end{pmatrix} \begin{pmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{m} \end{pmatrix}$

Similarly, the $\hat{\beta_0}$ is a column vector:

$\begin{pmatrix} \beta_{1} \\ \beta_{2} \\ \vdots \\ \beta_{m} \end{pmatrix}$

Multiplying everything together will produce a single column vector, $\hat{Y}$ .

We all do this so that we can run equations like

$Y = -23 + 3x_1 + 3.6x_2$

this over and over again across multiple sets of values, which are encoded as rows in the square matrices above $X$ .