TheCSPandz

Posted on

# Introduction to Regression for House Price Prediction

## What is Regression?

Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables). ~investopedia.com

In this post we will be using a House Rent Prediction Dataset from Kaggle to practically understand how Regression works. I will be working in a Kaggle Notebook

## Importing the dataset

``````df = pd.read_csv("/kaggle/input/house-rent-prediction-dataset/House_Rent_Dataset.csv")
``````

We use the `read_csv` method from `Pandas` (a data manipulation Python Library) to read the data from the CSV file and store it as a `DataFrame` or a `Table` called `df` or any name you provide.

## Exploring the dataset

Now that we have imported the data. We should see what data we are working with. If you would want to see an overview of your `Dataframe`, you can simply run a cell with your Dataframe name as done below

``````df
``````

Now you will see an overview of your Dataframe as shown below

Now, let us focus on the various columns and their data type. To do that, we can perform the following:

``````df.info()
``````

The `.info` will provide the various information of the Dataframe such as Data type counts, the columns, the data type of each column, etc. As shown below:

Now, we will have to choose what features we want to use. There are methods such as feature subselection to select the important features but to keep the post simple, we will work on features we feel are important, so we will use the `BHK`, `Size`, `Area Type`, `City`, `Furnishing Status`, and `Bathroom` as the independent variables and `Rent` as the dependent variable or the value we are trying to predict.

Let us now see the unique values in each of the independent variables.

``````df['BHK'].unique()
# array([2, 1, 3, 6, 4, 5])

df['Size'].unique()
# array([1100,  800,...])

df['Area Type'].unique()
# array(['Super Area', 'Carpet Area', 'Built Area'], dtype=object)

df['City'].unique()
# array(['Kolkata', 'Mumbai', 'Bangalore', 'Delhi', 'Chennai', 'Hyderabad'], dtype=object)

df['Furnishing Status'].unique()
# array(['Unfurnished', 'Semi-Furnished', 'Furnished'], dtype=object)

df['Bathroom'].unique()
#array([ 2,  1,  3,  5,  4,  6,  7, 10])
``````

You can observe that we have textual data for Area Type, City, and Furnishing Status. Let us convert these to numerical categories as done below:

``````df['Area Type'].replace(['Super Area', 'Carpet Area', 'Built Area'], [0, 1, 2], inplace=True)
# 'Super Area': 0, 'Carpet Area': 1, 'Built Area': 2

df['City'].replace(['Kolkata', 'Mumbai', 'Bangalore', 'Delhi', 'Chennai', 'Hyderabad'], [0, 1, 2, 3, 4, 5], inplace=True)
# 'Kolkata': 0, 'Mumbai': 1, 'Bangalore': 2, 'Delhi': 3, 'Chennai': 4, 'Hyderabad': 5

df['Furnishing Status'].replace(['Unfurnished', 'Semi-Furnished', 'Furnished'], [0, 1, 2], inplace=True)
# 'Unfurnished':0, 'Semi-Furnished':1, 'Furnished:2'
``````

What we did here is use the `.replace` method, which takes parameters of `(values_to_be_replaced,new_values, optional: inplace= True)`, here `inplace=True` permanently replaces the values in the data frame with the new one. Suppose you don't specify the inplace or mention it as False. In that case, it will temporarily replace the values; as a result, you can use it to assign various versions of your Dataframe to different variables without modifying the original one. Let us now assign the features to the variables `x` and `y` as shown:

``````x = df[["Size","Area Type","City","Furnishing Status","Bathroom"]]
y = df['Rent']
``````

Now, we need to check the performance of the model. As a result, we will split our data into a Training Set and a Testing Set. This is done by using the `train_test_split` method from `sklearn.model_selection` as shown below:

``````from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
``````

Here, our source data is x and y; we will use 20% of our data for testing, meaning 80% is used for training. The `random_state = 42` essentially ensures that the random splitting that the method performs on the data is reproducible.

## Visualization

Now, we shall visualize the relationship between the individual chosen feature and rent. Note that since the feature `Size` is continuous, for the plot, we will use the `scatter plot`, and since the others are categorical, we will use `Bar plots` to visualize the relationship between the feature and `Rent`.

``````plt.figure(figsize=(12, 12))

plt.subplot(3, 2, 1)
plt.scatter(df["Size"], df['Rent'])
plt.title('Size vs Rent')

plt.subplot(3, 2, 2)
plt.bar(df["Area Type"], df['Rent'])
plt.title('Area Type vs Rent')

plt.subplot(3, 2, 3)
plt.bar(df["City"], df['Rent'])
plt.title('City vs Rent')

plt.subplot(3, 2, 4)
plt.bar(df["Furnishing Status"], df['Rent'])
plt.title('Furnishing Status vs Rent')

plt.subplot(3, 2, 5)
plt.bar(df["Bathroom"], df['Rent'])
plt.title('Bathroom vs Rent')

plt.tight_layout()
plt.show()

``````

In the above code, we create a new figure of size 12 x 12 inches by using the `plot.figure(figsize=(12, 12))`. We create a scatter plot of size vs. rent and bar plots of the respective features vs. Rent. We then use `plt.tight_layout()` to prevent the subplots from overlapping.

## Model Creation

Linear regression is a technique used to model the relationships between observed variables. The idea behind simple linear regression is to "fit" the observations of two variables into a linear relationship between them. ~Brilliant.org

Polynomial regression is an extension of a standard linear regression model. Polynomial regression models the non-linear relationship between a predictor and an outcome variable ~ builtin.com

The Regression Models try to create a line though the all the points of the data by trying to minimize the Least Square Differences between the points along the Regression Line and the data points.

### Linear Regression

We will utilize the `LinearRegression` model from `sklearn.linear_model` as shown below and train the model on the training set using the `fit` method:

``````from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(x_train,y_train)
``````

Now after our Linear model is trained, let us predict the values of the `x_test` and assign the results to the variable `y_pred`, after which we will find the `Mean_Absolute_Error` of the predictions and the actual rent values as shown below:

``````y_pred=model.predict(x_test)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")
# Mean Absolute Error: 24368.853098314095
``````

### Polynomial Regression

To create a Polynomial Regression Model, we first modify the features to be in the degree of `n`, using the `PolynomialFeatures` from `sklearn.preprocessing` as shown below:

``````from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=3)
X_poly = poly_features.fit_transform(x)
``````

We will again split the new data as train and test and then fit the Linear Regression model on the polynomial data and predict the values for `x_test` :

``````x_train, x_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(x_train, y_train)
y_pred=model.predict(x_test)
``````

Now finally to test the performance of the model, we will use `Mean_Absolute_Error` of the predictions and the values of y_test.

``````mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")
# Mean Absolute Error: 16310.627544351808
``````

I hope this post was helpful in understanding how Regression works as well as a bit of data cleaning and visualization. Feel free to ask me any questions that you have. I will try my best to answer them. If you have any feedback for the post as well, feel free to let me know.

The various other performance measures for the Regression Models are:

• Mean Squared Error
• Root Mean Squared Error
• R2 Score

To learn the Maths behind Regression and the performance measures, here are are a few sources:

Regression Theory

Performance Metrics