What is Regression?
Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables). ~investopedia.com
In this post we will be using a House Rent Prediction Dataset from Kaggle to practically understand how Regression works. I will be working in a Kaggle Notebook
Importing the dataset
df = pd.read_csv("/kaggle/input/house-rent-prediction-dataset/House_Rent_Dataset.csv")
We use the read_csv
method from Pandas
(a data manipulation Python Library) to read the data from the CSV file and store it as a DataFrame
or a Table
called df
or any name you provide.
Exploring the dataset
Now that we have imported the data. We should see what data we are working with. If you would want to see an overview of your Dataframe
, you can simply run a cell with your Dataframe name as done below
df
Now you will see an overview of your Dataframe as shown below
Now, let us focus on the various columns and their data type. To do that, we can perform the following:
df.info()
The .info
will provide the various information of the Dataframe such as Data type counts, the columns, the data type of each column, etc. As shown below:
Now, we will have to choose what features we want to use. There are methods such as feature subselection to select the important features but to keep the post simple, we will work on features we feel are important, so we will use the BHK
, Size
, Area Type
, City
, Furnishing Status
, and Bathroom
as the independent variables and Rent
as the dependent variable or the value we are trying to predict.
Let us now see the unique values in each of the independent variables.
df['BHK'].unique()
# array([2, 1, 3, 6, 4, 5])
df['Size'].unique()
# array([1100, 800,...])
df['Area Type'].unique()
# array(['Super Area', 'Carpet Area', 'Built Area'], dtype=object)
df['City'].unique()
# array(['Kolkata', 'Mumbai', 'Bangalore', 'Delhi', 'Chennai', 'Hyderabad'], dtype=object)
df['Furnishing Status'].unique()
# array(['Unfurnished', 'Semi-Furnished', 'Furnished'], dtype=object)
df['Bathroom'].unique()
#array([ 2, 1, 3, 5, 4, 6, 7, 10])
You can observe that we have textual data for Area Type, City, and Furnishing Status. Let us convert these to numerical categories as done below:
df['Area Type'].replace(['Super Area', 'Carpet Area', 'Built Area'], [0, 1, 2], inplace=True)
# 'Super Area': 0, 'Carpet Area': 1, 'Built Area': 2
df['City'].replace(['Kolkata', 'Mumbai', 'Bangalore', 'Delhi', 'Chennai', 'Hyderabad'], [0, 1, 2, 3, 4, 5], inplace=True)
# 'Kolkata': 0, 'Mumbai': 1, 'Bangalore': 2, 'Delhi': 3, 'Chennai': 4, 'Hyderabad': 5
df['Furnishing Status'].replace(['Unfurnished', 'Semi-Furnished', 'Furnished'], [0, 1, 2], inplace=True)
# 'Unfurnished':0, 'Semi-Furnished':1, 'Furnished:2'
What we did here is use the .replace
method, which takes parameters of (values_to_be_replaced,new_values, optional: inplace= True)
, here inplace=True
permanently replaces the values in the data frame with the new one. Suppose you don't specify the inplace or mention it as False. In that case, it will temporarily replace the values; as a result, you can use it to assign various versions of your Dataframe to different variables without modifying the original one. Let us now assign the features to the variables x
and y
as shown:
x = df[["Size","Area Type","City","Furnishing Status","Bathroom"]]
y = df['Rent']
Now, we need to check the performance of the model. As a result, we will split our data into a Training Set and a Testing Set. This is done by using the train_test_split
method from sklearn.model_selection
as shown below:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
Here, our source data is x and y; we will use 20% of our data for testing, meaning 80% is used for training. The random_state = 42
essentially ensures that the random splitting that the method performs on the data is reproducible.
Visualization
Now, we shall visualize the relationship between the individual chosen feature and rent. Note that since the feature Size
is continuous, for the plot, we will use the scatter plot
, and since the others are categorical, we will use Bar plots
to visualize the relationship between the feature and Rent
.
plt.figure(figsize=(12, 12))
plt.subplot(3, 2, 1)
plt.scatter(df["Size"], df['Rent'])
plt.title('Size vs Rent')
plt.subplot(3, 2, 2)
plt.bar(df["Area Type"], df['Rent'])
plt.title('Area Type vs Rent')
plt.subplot(3, 2, 3)
plt.bar(df["City"], df['Rent'])
plt.title('City vs Rent')
plt.subplot(3, 2, 4)
plt.bar(df["Furnishing Status"], df['Rent'])
plt.title('Furnishing Status vs Rent')
plt.subplot(3, 2, 5)
plt.bar(df["Bathroom"], df['Rent'])
plt.title('Bathroom vs Rent')
plt.tight_layout()
plt.show()
In the above code, we create a new figure of size 12 x 12 inches by using the plot.figure(figsize=(12, 12))
. We create a scatter plot of size vs. rent and bar plots of the respective features vs. Rent. We then use plt.tight_layout()
to prevent the subplots from overlapping.
Model Creation
Linear regression is a technique used to model the relationships between observed variables. The idea behind simple linear regression is to "fit" the observations of two variables into a linear relationship between them. ~Brilliant.org
Polynomial regression is an extension of a standard linear regression model. Polynomial regression models the non-linear relationship between a predictor and an outcome variable ~ builtin.com
The Regression Models try to create a line though the all the points of the data by trying to minimize the Least Square Differences between the points along the Regression Line and the data points.
Linear Regression
We will utilize the LinearRegression
model from sklearn.linear_model
as shown below and train the model on the training set using the fit
method:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train,y_train)
Now after our Linear model is trained, let us predict the values of the x_test
and assign the results to the variable y_pred
, after which we will find the Mean_Absolute_Error
of the predictions and the actual rent values as shown below:
y_pred=model.predict(x_test)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")
# Mean Absolute Error: 24368.853098314095
Polynomial Regression
To create a Polynomial Regression Model, we first modify the features to be in the degree of n
, using the PolynomialFeatures
from sklearn.preprocessing
as shown below:
from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=3)
X_poly = poly_features.fit_transform(x)
We will again split the new data as train and test and then fit the Linear Regression model on the polynomial data and predict the values for x_test
:
x_train, x_test, y_train, y_test = train_test_split(X_poly, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(x_train, y_train)
y_pred=model.predict(x_test)
Now finally to test the performance of the model, we will use Mean_Absolute_Error
of the predictions and the values of y_test.
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae}")
# Mean Absolute Error: 16310.627544351808
I hope this post was helpful in understanding how Regression works as well as a bit of data cleaning and visualization. Feel free to ask me any questions that you have. I will try my best to answer them. If you have any feedback for the post as well, feel free to let me know.
The various other performance measures for the Regression Models are:
- Mean Squared Error
- Root Mean Squared Error
- R2 Score
To learn the Maths behind Regression and the performance measures, here are are a few sources:
Regression Theory
Performance Metrics
Top comments (0)