Table of Contents
- Introduction
- Dataset
- Visualizing the Data
- Splitting Data: Features and Target
- Splitting Data: Training and Testing
- Implementing Linear Regression
- Making Predictions
- Wrapping Up
- What’s Next?
Introduction
Are you curious about predicting house rent based on factors like area? Let's walk through building a simple predictive model using Python! We’ll use the House Rent Prediction Dataset from Kaggle and tools like Google Colab, Pandas, NumPy, and Matplotlib. For machine learning, we’ll leverage Scikit-Learn.
Dataset
We’ll use the dataset from Kaggle: House Rent Prediction. Download the .xlsx
file
import pandas as pd
# Load the Excel file
df = pd.read_excel("Rent.xlsx")
Explanation:
- The
pd.read_excel()
function from the pandas library reads data from an Excel file. - Parsing Data: The function parses the data from the file and creates a pandas DataFrame object.
-
DataFrame (
df
): This variable now holds the data from the Excel file in a structured format that can be easily manipulated and analyzed.
# Preview the data
df.head()
Explanation:
-
.head()
: Displays the first 5 rows of the DataFrame. This is a quick way to preview the structure of the data, including its columns, datatypes, and a sample of the actual values.
Fun Fact 🧠
The .head()
function gets its name because it displays the "head" or the first few rows of a dataset—just like a quick peek at the top of a document.
Visualizing the Data
To understand the relationship between area and rent, let’s plot a scatter plot.
import matplotlib.pyplot as plt
plt.scatter(df['area'], df['rent'])
plt.xlabel('area')
plt.ylabel('rent')
plt.show()
Explanation:
-
plt.scatter()
: Creates a scatter plot with:-
x=df['area']
: Values on the x-axis (independent variable, area). -
y=df['rent']
: Values on the y-axis (dependent variable, rent).
-
-
plt.xlabel()
andplt.ylabel()
: Add labels for the x and y axes.
Observation 📊
Here’s what we observe from the plot:
The relationship between area and rent seems somewhat linear—perfect for a Linear Regression Model!
Splitting Data: Features and Target
We’ll now separate our dataset into features (X) and target (Y).
# Selecting the feature and target variables
x = df.iloc[:, 0:1] # Feature: Area
y = df.iloc[:, -1] # Target: Rent
y
Explanation:
-
iloc
: This function provides integer-based indexing to select specific rows and columns. -
[:, 0:1]
: Selects all rows and the first column (0:1
includes column index 0, excludes column index 1). -
[:, -1]
: Selects all rows and the last column (-1
refers to the last column).
Pro Tip 📝
The iloc
function is useful for slicing data:
-
x = df.iloc[:, 0:1]
→ Selects the first column (Area). -
y = df.iloc[:, -1]
→ Selects the last column (Rent).
Splitting Data: Training and Testing
We need to split our data into training and testing sets. This helps us evaluate how well our model performs on unseen data.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2)
Explanation:
-
train_test_split()
: Splits arrays or matrices into random training and testing subsets. -
Arguments:
-
test_size=0.2
: Allocates 20% of the data to testing, 80% to training. -
random_state=2
: Ensures reproducibility of the split.
-
Fun Fact 🤓
Scikit-Learn, the library we’re using, was originally a Google Summer of Code project. It has since grown into one of the most widely-used tools for machine learning.
Implementing Linear Regression
Time to build and train our linear regression model! 🎉
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(x_train, y_train)
Explanation:
-
LinearRegression()
: Initializes the linear regression model. -
fit()
: Trains the model using:-
x_train
: Training feature data (area). -
y_train
: Training target data (rent).
-
Making Predictions
Let’s test our model by predicting the rent for a sample area from our test data.
# Predict rent for a specific area in the test set
lr.predict(x_test.loc[[x_test.index[2]], ['area']])
Explanation:
- The
predict()
method takes data points as input and returns predicted target values (e.g., rent). -
x_test.loc[]
: Retrieves specific rows and columns using label-based indexing. -
[x_test.index[2]]
: Selects the third row from the test set by its index. -
['area']
: Ensures only the 'area' column is used as the feature for prediction.
Output 🏡
Our model predicts a rent of ₹21,112, which is quite close to the actual rent of ₹21,500! 🎯
Wrapping Up
In this tutorial, we:
- Explored the House Rent Prediction Dataset.
- Visualized the relationship between area and rent.
- Built a Linear Regression Model using Scikit-Learn.
- Made predictions and validated their accuracy.
What’s Next?
Learn multivariate linear regression!
Happy coding 🌠
Top comments (0)