Use of polynomials in linear regression analysis - extension to linear models

#python #datascience #linearregression #biasvariancetradeoff

We are talking about linear regression:

And we are taking two concepts that we are very common in linear regression, and combine them to offset the limitations of both, to obtain better models and better results.

The two concepts are interactions and polynomials.

Interaction:

This is how the concept of interaction was introduced to me (more or less):
An interaction occurs when an independent variable has a different effect on the outcome depending on the values of another independent variable.
Or in other words: when building a linear regression model, if the effects of a feature on the target is influenced by another feature this means that there is an interaction between them.
This means that we can multiply the two variables, and in this way obtain a term that expresses their ‘interaction’. By adding this term to our linear regression equation our model is accounting for the interaction between these two terms.
It makes a lot of sense and it sounds pretty cool!
We can think of a few examples where this would apply: the yield of crops can be depending on a lot of factors that interact with each other like humidity and temperature, presence of certain nutrients in the soil exposure to sunlight etc.
Or in determining the risk of diabetes for an individual, certain factors might be interacting like age and hypertension or BMI.
More on interactions in linear regression here.

What was a little bit odd to me was that it seemed like given this possibility, we are left to kind of guess what type of interactions might exist, and the idea would be to try out a few interactions just by multiplying randomly some terms together, and just by trial and error try to find the interactions that were real, by looking at our R squared result and see which interactions improved it. Or if we could infer correctly on an interaction, we would still be left to guess randomly the exact term that would describe it (is the nature of the interaction multiplicative? And to which power for each variable? And what if it makes sense to include more variables in that interactions as well? And what coefficient should it have once it’s included in the equation?)

Polynomials

On to the second tool.
The other possibility we have is if we think that the relationship between our dependent variable and all the independent ones might not be linear, we can include polynomial terms to our equation.
The idea is that you can transform your input variable by e.g, squaring it. The corresponding model would then be:

The squared x at that point becomes a new variable to add to the equation.

Here below are some graphs that describe higher order of polynomials.

We can do this with higher orders and with the polynomial feature from sklearn.preprocessing import PolynomialFeatures that can calculate for us all the terms with all the independent variables we have, multiplied together up to a degree of polynomials that we get to set.
The problem with this approach is that once we use all the terms we have from our polynomial features our model is going to tend to overfit. Because all those terms will probably describe our curve very well, most likely too well, picking up the real trends of the relationship between the variables, but also picking up some random noise.
Let us see this with a visual example:

This in an example of when the equation in our model follows our sample so precisely, that they end up overfitting: the model has picked up on the noise as well as the signal in the data.
These models will perform extremely well on our train data (the one we used to build the model) but very poorly with any new data.

What we aim for our model to do is to pick up the real trend, without describing in every detail the noise.
These type of models might have slightly lower results on the train data, because they don’f fit it perfectly but they will also perform very well with new unseen data, because they describe well the relationship between the dependent and independent variables.
This is what this type of model would look like:

It is clear that our model 'has understood' the general relationship between the target and the input, and that once it is fed data that is different, but describes the same phenomenon, it should be able to identify that relationship again.

Usually...

What is recommended usually is when it’s clear that the polynomial relationship with a certain degree is overfitting, lower the degree of the polynomials.
But this might (or at least it did in my case) end up making the polynomial terms not very relevant, since reducing even just by one order results in cutting down many terms (especially if we have many variables).
The level of the fit dramatically decreases, making the whole use of the polynomial terms almost pointless.

Putting these two together

The interesting fact is that actually those terms of the polynomials are the same thing as the interaction terms, just with higher degrees. And the Polynomial features does the favor of calculating all of those terms for us, with two simple lines of code.
The idea I had at this point was to not try to just randomly test a few interaction terms, but use the terms from the polynomial features and add those to the equation in my linear model.
What I did in practice is that I selected a high level of polynomials, that would usually lead to overfitting, and add those to my OLS model and run it normally.
At this point we have a chance thanks to statsmodel, to see the summary and extract the coefficients.
It was simple to sort those values and select the top 5 (or as many as we want) and study those.
One thing that we can do is look at these terms to see if there is something interesting there, since these can reveal some sort of interaction or correlation that can explain the model better to us. The top terms are the ones that are most influential in determining the outcome in the dependent variable.
Looking at these terms therefore can tell us a whole lot about the problem that we are trying to solve and what our target variable depends on.
The other thing that we can do that can definitely improve our results is to include these top terms in our model, without including all the other ones produced by the PolynomialFeatures.
Why does this make a lot of sense?
Because these terms that have a heavier weight are probably the ones that describe the main term of the relationship, being the heaviest ones in the equation. Most likely these terms are the ones that capture the general trend and the overall more generic shape of our curve that describes the relationship. (especially if we have done preprocessing and scaling in an appropriate way).
The other terms that we are leaving out, the ones that have lower weights, are most likely the ones instead picking up the noise! Since the noise is usually randomly distributed, this would be what is described by terms that have lower weights, that maybe describe well the shape of a curve, but only for small little parts of it, that sounds a lot like noise doesn’t it?

I tried this myself in my project, and it worked great.
When I used the model with the third degree polynomials the train performed pretty well with a R squared of 0.72, but the R squared for the test was a disastrous -0.93.

When instead I used the polynomials of third degree, but only picking the top 5 terms with highest coefficients, the RMSE for both the train and the test reached 0.8!

Conclusion

This technique can be used extensively, by increasing the degree of the polynomials included and increasing also the number of terms that we choose to include: in my case was 5 but more and more can be included granted to keep an eye on the R squared of the test to avoid overfitting.
We can learn a great deal from these terms about what is the nature of the relationship between the variables and their interaction, while improving our model’s results considerably without falling into the trap of overfitting.

I had to learn later, that there are techniques that do bring us to a similar result, selecting only some of the features or increasing and decreasing their weights to improve the results of the model (Ridge and Lasso regularization, select k best, wrapper methods…) but I feel that these methods tend to lean too much toward the black box model, moving toward predictive modeling from inferential statistic, where we won’t necessarily be able to see what the coefficients are, what weights they have or the way that these methods picked them.
It would be probably more efficient, but I personally found a lot of value and satisfaction to try, even just one time, by hand this selection of features, and I wonder if this is the type of thing in which ultimately machines won’t be able to fully replace us (I hope!) because they won't be able to grasp the information that we can, understanding why a certain relationship between two variable makes sense, or exploring further into what it entails.
Even high volumes of calculations performed at a super high speed ultimately can’t beat the judgement and understanding of a human being, that can see an information and make connections and apply critical thinking in front of what he just learnt.