DEV Community

Cover image for House_Price_Prediction

House_Price_Prediction

In the world of real estate, determining property prices involves numerous factors, from location and size to amenities and market trends. Simple linear regression, a foundational technique in machine learning, provides a practical way to predict housing prices based on key features like the number of rooms or square footage.

In this article, I delve into the process of applying simple linear regression to a housing dataset, from data preprocessing and feature selection to building a model that can offer valuable price insights. Whether you’re new to data science or seeking to deepen your understanding, this project serves as a hands-on exploration of how data-driven predictions can shape smarter real estate decisions.

First things first, you start by importing your libraries:

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
Enter fullscreen mode Exit fullscreen mode
#Read from the directory where you stored the data

data  = pd.read_csv('/kaggle/input/california-housing-prices/housing.csv')
Enter fullscreen mode Exit fullscreen mode
data
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

#Test to see if there arent any null values
data.info()
Enter fullscreen mode Exit fullscreen mode

Image description

#Trying to draw the same number of null values
data.dropna(inplace = True)
Enter fullscreen mode Exit fullscreen mode
data.info()
Enter fullscreen mode Exit fullscreen mode

Image description

#From our data, we are going to train and test our data

from sklearn.model_selection import train_test_split

X = data.drop(['median_house_value'], axis = 1)
y = data['median_house_value']
Enter fullscreen mode Exit fullscreen mode
y
Enter fullscreen mode Exit fullscreen mode

Image description

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
Enter fullscreen mode Exit fullscreen mode
#Examining correlation between x and y training data
train_data = X_train.join(y_train)
Enter fullscreen mode Exit fullscreen mode
train_data
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

#Visualizing the above
train_data.hist(figsize=(15, 8))
Enter fullscreen mode Exit fullscreen mode

Image description

#Encoding non-numeric columns to see if they are useful and categorical for analysis

train_data_encoded = pd.get_dummies(train_data, drop_first=True)
correlation_matrix = train_data_encoded.corr()
print(correlation_matrix)
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

Image description

train_data_encoded.corr()
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

Image description

plt.figure(figsize=(15,8))
sns.heatmap(train_data_encoded.corr(), annot=True, cmap = "inferno")
Enter fullscreen mode Exit fullscreen mode

Image description

train_data['total_rooms'] = np.log(train_data['total_rooms'] + 1)
train_data['total_bedrooms'] = np.log(train_data['total_bedrooms'] +1)
train_data['population'] = np.log(train_data['population'] + 1)
train_data['households'] = np.log(train_data['households'] + 1)
Enter fullscreen mode Exit fullscreen mode
train_data.hist(figsize=(15, 8))
Enter fullscreen mode Exit fullscreen mode

Image description

#convert ocean_proximity factors into binary's using one_hot_encoding
train_data.ocean_proximity.value_counts()
Enter fullscreen mode Exit fullscreen mode

ocean_proximity
<1H OCEAN 7267
INLAND 5183
NEAR OCEAN 2108
NEAR BAY 1783
ISLAND 5
Name: count, dtype: int64

#For each feature of the above we will then create its binary(0 or 1)
pd.get_dummies(train_data.ocean_proximity)
Enter fullscreen mode Exit fullscreen mode

Image description

#Dropping afterwards the proximity
train_data = train_data.join(pd.get_dummies(train_data.ocean_proximity)).drop(['ocean_proximity'], axis=1)
Enter fullscreen mode Exit fullscreen mode
train_data
Enter fullscreen mode Exit fullscreen mode

Image description

Image description

#recheck for correlation
plt.figure(figsize=(18, 8))
sns.heatmap(train_data.corr(), annot=True, cmap ='twilight')
Enter fullscreen mode Exit fullscreen mode

Image description

#visualize the coordinates
plt.figure(figsize=(15, 8))
sns.scatterplot(x='latitude', 
                y = 'longitude',
                data= train_data, 
                hue='median_house_value', palette='Spectral')
Enter fullscreen mode Exit fullscreen mode

Image description

#Combine new features with the ones we already have(using feature engineering)

train_data['bedroom_ratio'] = train_data['total_bedrooms']/train_data['total_rooms']
train_data['household_rooms'] = train_data['total_rooms']/train_data['households']
Enter fullscreen mode Exit fullscreen mode
#show correlation
plt.figure(figsize=(18, 8))
sns.heatmap(train_data.corr(), annot=True, cmap ='ocean')
Enter fullscreen mode Exit fullscreen mode

Image description

#train data using linear regression
from sklearn.linear_model import LinearRegression

X_train, y_train = train_data.drop(['median_house_value'], axis=1), train_data['median_house_value']

reg = LinearRegression()

reg.fit(X_train, y_train)
Enter fullscreen mode Exit fullscreen mode

Image description

# Assuming 'data' is the original dataset with 'ocean_proximity'
test_data = X_test.join(y_test)
test_data['ocean_proximity'] = data.loc[test_data.index, 'ocean_proximity']

Enter fullscreen mode Exit fullscreen mode
#Join X_test and y_test to form test_data
test_data = X_test.join.join(y_test)

#Apply log transformations
test_data['total_rooms'] = np.log(test_data['total_rooms'] + 1)
test_data['total_bedrooms'] = np.log(test_data['total_bedrooms'] + 1
test_data['population'] = np.log(test_data['population'] + 1)
test_data['households'] = np.log(test_data['households'] + 1)

#One-hot encode 'ocean_proximity' with the same prefix as in training data
#Assuming the training data used no prefix or a different one, adjust accordingly
#test_data = test_data.join(pd.get_dummies(test_data['ocean_proximity'], prefix='')).drop(['ocean_proximity'], axis=1) #prefix =''for no prefix
# or rename columns to match training data after one-hot encoding
test_data = test_data.rename(columns = {
       'ocean_<1H OCEAN': '<1H OCEAN',
       'ocean_INLAND  : 'INLAND',
       'ocean_ISLAND : 'ISLAND',
       'ocean_NEAR BAY: 'NEAR BAY',
       'ocean_NEAR OCEAN : 'NEAR OCEAN'
})

#Create new feature columns
test_data['bedroom_ratio'] = test_data['total_bedrooms']/test_data['total_rooms']
test_data['household_rooms'] = test_data['total_rooms']/test_data['household_rooms]
Enter fullscreen mode Exit fullscreen mode
#test_data = test_data.join(pd.get_dummies(test_data['ocean_proximity']. prefix='')).drop(['ocean_proximity'], axis=1) #prefix='' for no prefix #or rename columns to match training data after one-hot encoding:
test_data = test_data.rename(columns = {
       'ocean_<1H OCEAN': '<1H OCEAN',
       'ocean_INLAND  : 'INLAND',
       'ocean_ISLAND : 'ISLAND',
       'ocean_NEAR BAY: 'NEAR BAY',
       'ocean_NEAR OCEAN : 'NEAR OCEAN'
})
Enter fullscreen mode Exit fullscreen mode
X_test, y_test = test_data.drop(['median_house_value'], axis=1), test_data['median_house_value']
Enter fullscreen mode Exit fullscreen mode
X_test_s = scaler.transfrom(X_test)
Enter fullscreen mode Exit fullscreen mode
reg.score(X_test_s, y_test)
Enter fullscreen mode Exit fullscreen mode

0.5092972905670141

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

forest = RandomForestRegressor()

forest.fit(X_train_s, y_train)
Enter fullscreen mode Exit fullscreen mode

Image description

forest.score(X_test_s, y_test)
Enter fullscreen mode Exit fullscreen mode

0.4447616558596853

from sklearn.model_selection import GridSearchCV

param_grid ={
     'n_estimators': [3, 10, 30],
     'max_features' : [2, 4, 6, 8]

}

grid_search = GridSearchCV(forest, Param_grid, 
                           cv=5,
                           scoring ="neg_mean_squared_error",
                          return_train_score=True)
grid_search.fit(X_train_s, y_train)

Enter fullscreen mode Exit fullscreen mode

Image description

grid_search.best_estimator_
Enter fullscreen mode Exit fullscreen mode

Image description

grid_search.best_estimator_.score(X_test_s, y_test)
Enter fullscreen mode Exit fullscreen mode

0.5384474921332503

I would really say that training a machine is not the easiest of processes but to keep improving the results above you can add more features under the param_grid such as the min_feature and in that way your best estimator score can keep on improvimng.

If you got till this far please like and share your comment below, your opinion really matters. Thank you!😊🥰❤️

Top comments (0)