DEV Community

Cover image for Data Transformation is All You Need
Prayson Wilfred Daniel
Prayson Wilfred Daniel

Posted on • Updated on

Data Transformation is All You Need

In the world of machine learning and data science, sophisticated algorithms and models often grab the limelight. However, sometimes, the crux of a successful predictive model isn't just about choosing the right algorithm, but more about presenting the data in a manner that the algorithm can best learn from.

This short article, born from my procrastination, explored linear regression model in light of the power of data visualization and transformation. My goal is to show how a simple change in data representation can improve model performance.

Setting the Stage: Basic Linear Regression

To illustrate this point, consider a dataset that captures the relationship between TV advertisements and sales. We start our exploration by loading our data and visualizing the relationship.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split


# jupyter labextension install plotlywidget
pd.options.plotting.backend = "plotly"

ADS_DATA_RI = "https://raw.githubusercontent.com/justmarkham/scikit-learn-videos/master/data/Advertising.csv"

dataf = pd.read_csv(ADS_DATA_RI, index_col=0)

# split my data to test the performance of the model
X, y = dataf[["TV"]], dataf["Sales"]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, random_state=7)


(dataf
     .assign(split=lambda d: np.where(d.index.isin(X_train.index), "train", "test"))
     .plot(kind="scatter",
           x="TV",
           y="Sales",
           color="split",
           trendline="ols",
           trendline_scope="overall", 
           trendline_color_override="darkblue",
           opacity=.7,
           title="Sales ~ TV Ads")
)
Enter fullscreen mode Exit fullscreen mode

thing

The scatter plot shows a not entirely linear relationship between TV advertisements and sales, and our linear regression model doesn't accurately capture sales’ trends for lower value TV ad spending.

Let's reverse engineer the linear model equation using scikit-learn:


from numpy.typing import NDArray
from sklearn.linear_model import LinearRegression
from sklearn import metrics

lm = LinearRegression(fit_intercept=True)
lm.fit(X_train, y_train)

def get_equation(features:NDArray[str], coef:NDArray[float], intercept:float,  y:str="Sales", ) -> str:
    equation = " + " .join([*[f"{coef:.5f} * {feature}" 
                   for feature, coef  in zip(features, coef)],
                 f"{intercept:.5f}"])

    return f"{y} ~ {equation}"


y_func = get_equation(features=lm.feature_names_in_, coef=lm.coef_, intercept=lm.intercept_)
print(y_func)
# 'Sales ~ 0.04727 * TV + 7.17408'

Enter fullscreen mode Exit fullscreen mode

And the R-squared score, which gives an idea about how well our model explains the variability of the response data around its mean, is:

y_pred = lm.predict(X_test)
R2 = f"{metrics.r2_score(y_true=y_test, y_pred=y_pred):.7f}"
print(R2)
# '0.6779489'
Enter fullscreen mode Exit fullscreen mode

However, on plotting the residuals, which is the difference between actual and predicted values, we observe a pattern:

...
(pd
     .DataFrame(y_test - y_pred)
     .rename(columns={"Sales": "Resuduals"})
     .assign(TV = X_test,
              Position = X_test.index)
     .plot(kind="scatter",
           y="Resuduals",
           x="Position",
           title=f"True Sales - Predicted Sales: {R2=} ",
           **fig_size
          )
     .add_hline(y=0)
     .update_traces(mode="markers", hovertemplate=None)
     .update_layout(hovermode="y unified")
)
Enter fullscreen mode Exit fullscreen mode

Residuals

This pattern in residuals suggests our linear model isn't perfect. The residuals shouldn’t exhibit any pattern if our model captured the underlying data's structure accurately.

Bend It Like Beckham

The visualization suggests that the relationship between TV ads and sales isn't strictly linear but appears to be inverse-exponential initially. This suggests that the first circa $1-$50 spent on advertising might yield slower returns in sales before picking up momentum.

A simple way to capture this potential relationship is by transforming both the input (TV ads) and the output (Sales) using a logarithm.

# Data Transformation

...
(dataf
     .assign(split=lambda d: np.where(d.index.isin(X_train.index), "train", "test"))
     .plot(kind="scatter",
           x="TV",
           y="Sales",
           color="split",
           trendline="ols",
           trendline_options=dict(log_x=True, log_y=True), # power of transformation
           trendline_scope="overall", 
           trendline_color_override="darkblue",
           opacity=.7,
           title="log(Sales) ~ log(TV) Ads + c",
           **fig_size)
)

Enter fullscreen mode Exit fullscreen mode

log

Our transformed scatter plot indicates a more linear relationship between log-transformed TV ads and sales. This is a clear indication that our initial hypothesis about the logarithmic relationship might be true.

Transformed Linear Regression: A Better Fit?

Let's utilize the log-transformed data and fit a linear regression model:

from sklearn.compose import TransformedTargetRegressor
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import make_pipeline


def inverse_log10(X:float)->float:
    return 10 ** X

transformer = FunctionTransformer(func=np.log10, inverse_func=inverse_log10)
linear_model = TransformedTargetRegressor(regressor=LinearRegression(), func=np.log10, inverse_func=inverse_log10)

lmx = make_pipeline(transformer, linear_model)
lmx.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Model

The R-squared score for this new model is:

y_pred = lmx.predict(X_test)
R2 = f"{metrics.r2_score(y_true=y_test, y_pred=y_pred):.7f}"
print(R2)
# 0.6962380 
Enter fullscreen mode Exit fullscreen mode

The score is higher than our initial model (0.67). It indicates that our transformed model is explaining more variability in the sales data.

Visualizing our transformed model alongside the actual data, we get:

(dataf
     .assign(split=lambda d: np.where(d.index.isin(X_train.index), "train", "test"))
     .plot(kind="scatter",
           x="TV",
           y="Sales",
           color="split",
           opacity=.7,
           title="log10(Sales) ~ log10(TV) + c",
           **fig_size)


).add_traces(
    (dataf
  .assign(pred = lmx.predict(dataf[["TV"]]))
 .sort_values(by=["TV"])
 .plot(kind="line",
              x="TV",
              y="pred",
           **fig_size
              ).update_traces(line_color="darkblue")

    ).data)
Enter fullscreen mode Exit fullscreen mode

scikit-learn

Next Step: Bayesian Linear Regression Model

We've noticed that as TV ads increase in size, the values deviate more from our fitted line. Using Bayesian linear regression would be a better approach since we would not have a single line but a distribution of lines and thus capturing the uncertainty differently across TV Ads.

Say hello to PyMC. We can build our linear model as follows:

modelx

import arviz as az
import pymc as pm


log_Sales, log_TV = np.log10(y_train.to_numpy()), np.log10(X_train["TV"].to_numpy())

with pm.Model() as model:
    log_Sales_ = pm.MutableData("log_Sales", log_Sales)
    log_TV_ = pm.MutableData("log_TV", log_TV)

    β0 = pm.Normal("β0", mu=0, sigma=20, )
    β1 = pm.Normal("β1", mu=0, sigma=20,)
    sigma = pm.HalfNormal("σ", sigma=10)


    mu = β0 + β1 * log_TV_
    pm.Normal("log(Sales)", mu=mu, sigma=sigma, observed=log_Sales_, shape=mu.shape)

# sudo apt install graphviz
pm.model_to_graphviz(model)
Enter fullscreen mode Exit fullscreen mode

model_image

With Bayesian modelling, we sample to capture the uncertainty in our model's parameters based on observed data and prior information.

with model:
    idata = pm.sample(3000, chains=4, target_accept=0.85, random_seed=42)
Enter fullscreen mode Exit fullscreen mode

similiar

# to predict and evaluate our model
with model:
    pm.set_data({"log_TV":  np.log10(X_test["TV"].to_numpy())})


    pm.sample_posterior_predictive(
        idata,
        predictions=True,
        extend_inferencedata=True,
        random_seed=42,

    )

y_pred = 10 ** idata.predictions["log(Sales)"].mean(dim=["chain", "draw"]).data

R2, RMSE = metrics.r2_score(y_test, y_pred), metrics.mean_squared_error(y_test, y_pred, squared=False)

print(R2, RMSE)
# 0.6949903971655622, 2.8620992998783623
Enter fullscreen mode Exit fullscreen mode

Plotting the fitted lines


from collections import defaultdict

container = defaultdict(list)

β0 = idata.posterior["β0"][:, ::25].to_numpy().flatten()
β1 = idata.posterior["β1"][:, ::25].to_numpy().flatten()

for b0, b1, i in zip(β0 , β1, range(len(β1))):
    container[f"Sales_{i}"].extend(10 ** (b0 + b1 * np.log10(X_train["TV"])))

container["Sales_mean"].extend(10 ** (β0.mean().item() + β1.mean().item () *  np.log10(X_train["TV"])))
container["Sales"].extend(y_train)
container["TV"].extend(X_train["TV"])


p = pd.DataFrame(container)

fig = (p
     .plot(kind="scatter",
           x="TV",
           y="Sales",
           opacity=.7,
           title=f"{R2=:.4f} {RMSE=:.4f} | log10(Sales) ~ log10(TV) + c",
           **fig_size)


)

# adding all possible lines
for column in p.columns[:-3]:
    fig = fig.add_traces(
    (p
     .sort_values(by=["TV"])
     .plot(kind="line",
                  x="TV",
                  y=column,

               **fig_size
                  ).update_traces(line_color="lightblue", opacity = 0.07,)

        ).data)

# add the mean 
fig.add_traces(
    (p
 .sort_values(by=["TV"])
 .plot(kind="line",
              x="TV",
              y="Sales_mean",
           **fig_size
              ).update_traces(line_color="darkblue")

    ).data)
Enter fullscreen mode Exit fullscreen mode

last_image

What makes Bayesian modeling intriguing is that our initial model is just a beginning. We have complete freedom to create based on our imagination. Next, we can segment our data into TV ad buckets or bins and develop multiple linear models for each one. That could be a future article on itself 🤭.


Takeaway: 🍔

The main takeaway here is that the success of a predictive model doesn't always lie in choosing a more complex algorithm. Before you reach for a more intricate model, take a moment to step back and consider if a data transformation could simplify things. Remember, sometimes, data transformation is all you need.

Until then, keep on coding data science simplified.

Top comments (0)