Anurag Verma

Posted on

# Linear Regression in Python: From Data to Model

## What is Linear Regression?

Linear regression is a statistical method used for modeling the relationship between a dependent variable (also known as the outcome or response variable) and one or more independent variables (also known as predictors or explanatory variables). The goal of linear regression is to find the best-fitting line through a set of data points, where the line is defined by an equation of the form y = mx + b, where y is the dependent variable, x is the independent variable, m is the slope of the line, and b is the y-intercept. Linear regression can be used for both simple linear regression (one independent variable) and multiple linear regression (more than one independent variable).

## Importing Libraries

``````import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

``````

## Loding train and test dataset into pandas data frame

``````train_df = pd.read_csv("/kaggle/input/random-linear-regression/train.csv")
#Drop null values
train_df = train_df.dropna()
``````
x y
0 24.0 21.549452
1 50.0 47.464463
2 15.0 17.218656
3 38.0 36.586398
4 87.0 87.288984
``````test_df = pd.read_csv("/kaggle/input/random-linear-regression/test.csv")
# Drop null values
test_df = test_df.dropna()
``````
x y
0 77 79.775152
1 21 23.177279
2 22 25.609262
3 20 17.857388
4 36 41.849864

## Selection of independent and and dependent variable

We selected the columns in your data frame that we want to use for the x and y axis. For example, if you have a column called 'x' that represents the independent variable and a column called 'y' that represents the dependent variable, you can select those columns like this:

``````train_x = train_df['x']
train_y = train_df['y']

test_x = test_df['x']
test_y = test_df['y']
``````

## Visualizing the training data

To draw a linear graph using your data frame, we use the popular data visualization library in Python called Matplotlib. We imported it above.

Now we use the `plt.scatter()` function to plot the data points, and the `plt.plot()` function to plot the line of best fit.

We also use the `numpy.polyfit()` function to fit a line to the data points and get the slope and y-intercept of the line of best fit.

``````coefficients = np.polyfit(train_x, train_y, 1)
m, b = coefficients
plt.scatter(train_x, train_y)
plt.plot(train_x, m*train_x + b)
plt.xlabel('train_x')
plt.ylabel('train_y')
plt.show()
``````

## Visualizing test data

``````coefficients = np.polyfit(test_x, test_y, 1)
m, b = coefficients
plt.scatter(test_x, test_y)
plt.plot(test_x, m*test_x + b)
plt.xlabel('test_x')
plt.ylabel('test_y')
plt.show()
``````

## Model Creation, training, and testing

To create a linear regression model and train and test the data using your data frame, we can use the `scikit-learn` library in Python. The first step is to import the library and the specific model you want to use.

For example, we use the `LinearRegression` class from the `sklearn.linear_model` module:

``````from sklearn.linear_model import LinearRegression
``````

Create an instance of the model.

``````model = LinearRegression()
``````

Now, we use the `fit()` method to train the model on the training data:

``````train_x = train_x.values.reshape(-1, 1)
test_x = test_x.values.reshape(-1, 1)
``````
``````model.fit(train_x, train_y)
``````

Check the coefficients of the model and the intercept using following command:

``````print("Coefficients: ",model.coef_)
print("Intercept: ",model.intercept_)
``````

Our model is trained, now we can use the `predict()` method to make predictions on the test data:

``````y_pred = model.predict(test_x)
``````

## Evaluating model performance

We can evaluate the performance of the model by comparing the predicted values with the actual values. There are many evaluation metrics such as `mean_absolute_error`, `mean_squared_error` or `r2_score`.

``````from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("Mean Absolute Error: ",mean_absolute_error(test_y, y_pred))
print("Mean Squared Error: ",mean_squared_error(test_y, y_pred))
print("R2 Score: ",r2_score(test_y, y_pred))
``````

## Visualizing model performance

We can also visualize the results by plotting the test data points and the predicted line using the same approach as before.

``````plt.scatter(test_x, test_y)
plt.plot(test_x, y_pred, color='r')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
``````