TheCSPandz

Posted on May 6, 2024

Introduction to Regression for House Price Prediction

#regression #python

What is Regression?

Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables). ~investopedia.com

In this post we will be using a House Rent Prediction Dataset from Kaggle to practically understand how Regression works. I will be working in a Kaggle Notebook

Importing the dataset

df = pd.read_csv("/kaggle/input/house-rent-prediction-dataset/House_Rent_Dataset.csv")

We use the read_csv method from Pandas (a data manipulation Python Library) to read the data from the CSV file and store it as a DataFrame or a Table called df or any name you provide.

Exploring the dataset

Now that we have imported the data. We should see what data we are working with. If you would want to see an overview of your Dataframe, you can simply run a cell with your Dataframe name as done below

df

Now you will see an overview of your Dataframe as shown below

Now, let us focus on the various columns and their data type. To do that, we can perform the following:

df.info()

The .info will provide the various information of the Dataframe such as Data type counts, the columns, the data type of each column, etc. As shown below:

Now, we will have to choose what features we want to use. There are methods such as feature subselection to select the important features but to keep the post simple, we will work on features we feel are important, so we will use the BHK, Size, Area Type, City, Furnishing Status, and Bathroom as the independent variables and Rent as the dependent variable or the value we are trying to predict.

Let us now see the unique values in each of the independent variables.

df['BHK'].unique()
# array([2, 1, 3, 6, 4, 5])

df['Size'].unique()
# array([1100,  800,...])

df['Area Type'].unique()
# array(['Super Area', 'Carpet Area', 'Built Area'], dtype=object)

df['City'].unique()
# array(['Kolkata', 'Mumbai', 'Bangalore', 'Delhi', 'Chennai', 'Hyderabad'], dtype=object)

df['Furnishing Status'].unique()
# array(['Unfurnished', 'Semi-Furnished', 'Furnished'], dtype=object)


df['Bathroom'].unique()
#array([ 2,  1,  3,  5,  4,  6,  7, 10])

You can observe that we have textual data for Area Type, City, and Furnishing Status. Let us convert these to numerical categories as done below:

df['Area Type'].replace(['Super Area', 'Carpet Area', 'Built Area'], [0, 1, 2], inplace=True)
# 'Super Area': 0, 'Carpet Area': 1, 'Built Area': 2

df['City'].replace(['Kolkata', 'Mumbai', 'Bangalore', 'Delhi', 'Chennai', 'Hyderabad'], [0, 1, 2, 3, 4, 5], inplace=True)
# 'Kolkata': 0, 'Mumbai': 1, 'Bangalore': 2, 'Delhi': 3, 'Chennai': 4, 'Hyderabad': 5

df['Furnishing Status'].replace(['Unfurnished', 'Semi-Furnished', 'Furnished'], [0, 1, 2], inplace=True)
# 'Unfurnished':0, 'Semi-Furnished':1, 'Furnished:2'

What we did here is use the .replace method, which takes parameters of (values_to_be_replaced,new_values, optional: inplace= True), here inplace=True permanently replaces the values in the data frame with the new one. Suppose you don't specify the inplace or mention it as False. In that case, it will temporarily replace the values; as a result, you can use it to assign various versions of your Dataframe to different variables without modifying the original one. Let us now assign the features to the variables x and y as shown:

x = df[["Size","Area Type","City","Furnishing Status","Bathroom"]]
y = df['Rent']

Now, we need to check the performance of the model. As a result, we will split our data into a Training Set and a Testing Set. This is done by using the train_test_split method from sklearn.model_selection as shown below:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

Here, our source data is x and y; we will use 20% of our data for testing, meaning 80% is used for training. The random_state = 42 essentially ensures that the random splitting that the method performs on the data is reproducible.

Visualization

Now, we shall visualize the relationship between the individual chosen feature and rent. Note that since the feature Size is continuous, for the plot, we will use the scatter plot, and since the others are categorical, we will use Bar plots to visualize the relationship between the feature and Rent.

plt.figure(figsize=(12, 12))

plt.subplot(3, 2, 1)
plt.scatter(df["Size"], df['Rent'])
plt.title('Size vs Rent')

plt.subplot(3, 2, 2)
plt.bar(df["Area Type"], df['Rent'])
plt.title('Area Type vs Rent')

plt.subplot(3, 2, 3)
plt.bar(df["City"], df['Rent'])
plt.title('City vs Rent')

plt.subplot(3, 2, 4)
plt.bar(df["Furnishing Status"], df['Rent'])
plt.title('Furnishing Status vs Rent')

plt.subplot(3, 2, 5)
plt.bar(df["Bathroom"], df['Rent'])
plt.title('Bathroom vs Rent')

plt.tight_layout()
plt.show()

In the above code, we create a new figure of size 12 x 12 inches by using the plot.figure(figsize=(12, 12)). We create a scatter plot of size vs. rent and bar plots of the respective features vs. Rent. We then use plt.tight_layout() to prevent the subplots from overlapping.

Model Creation

Linear regression is a technique used to model the relationships between observed variables. The idea behind simple linear regression is to "fit" the observations of two variables into a linear relationship between them. ~Brilliant.org

Polynomial regression is an extension of a standard linear regression model. Polynomial regression models the non-linear relationship between a predictor and an outcome variable ~ builtin.com

The Regression Models try to create a line though the all the points of the data by trying to minimize the Least Square Differences between the points along the Regression Line and the data points.

Linear Regression

We will utilize the LinearRegression model from sklearn.linear_model as shown below and train the model on the training set using the fit method:

from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(x_train,y_train)

Now after our Linear model is trained, let us predict the values of the x_test and assign the results to the variable y_pred, after which we will find the Mean_Absolute_Error of the predictions and the actual rent values as shown below:

y_pred=model.predict(x_test)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")
# Mean Absolute Error: 24368.853098314095

Polynomial Regression

To create a Polynomial Regression Model, we first modify the features to be in the degree of n, using the PolynomialFeatures from sklearn.preprocessing as shown below:

from sklearn.preprocessing import PolynomialFeatures

poly_features = PolynomialFeatures(degree=3)
X_poly = poly_features.fit_transform(x)

We will again split the new data as train and test and then fit the Linear Regression model on the polynomial data and predict the values for x_test :

x_train, x_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(x_train, y_train)
y_pred=model.predict(x_test)

Now finally to test the performance of the model, we will use Mean_Absolute_Error of the predictions and the values of y_test.

mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")
# Mean Absolute Error: 16310.627544351808

I hope this post was helpful in understanding how Regression works as well as a bit of data cleaning and visualization. Feel free to ask me any questions that you have. I will try my best to answer them. If you have any feedback for the post as well, feel free to let me know.

The various other performance measures for the Regression Models are:

Mean Squared Error
Root Mean Squared Error
R2 Score

To learn the Maths behind Regression and the performance measures, here are are a few sources:

Regression Theory

Performance Metrics

DEV Community

Introduction to Regression for House Price Prediction

What is Regression?

Importing the dataset

Exploring the dataset

Visualization

Model Creation

Linear Regression

Polynomial Regression

Top comments (0)

Read next

How to Use PySpark for Machine Learning

ChatWithSQL — Secure, Schema-Validated Text-to-SQL Python Library, Eliminating Arbitrary Query Risks from LLMs

How to Optimize Loops for Better Performance

Transform Your Images into Pencil Sketches with Python 🚀