DEV Community

Cover image for Logistic Regression
TheCSPandz
TheCSPandz

Posted on

Logistic Regression

What is Logistic Regression?

Logistic Regression is a statistical model commonly used for binary classification tasks, such as spam email detection. Despite its name, it is not used for regression but instead predicts the probability that an instance belongs to a particular class (usually class 1, the positive class).

The model uses the sigmoid (or logistic) function, which outputs a probability between 0 and 1, and is defined as follows:

y=11+eθTx y = \frac{1}{1 + e^{-\theta^T x}}

Here, θ is the vector of coefficients, and ( x ) is the feature vector (i.e., the input data for a single instance). The expression θ^T x is the dot product of these two vectors.

The sigmoid function maps any input to an output in the range (0, 1), representing the probability of the positive class. This probability is then compared to a threshold (commonly 0.5). If the probability is greater than or equal to the threshold, the predicted class is 1 (positive class); otherwise, it is 0 (negative class).

We can express the thresholding mechanism as follows:

f(x)={0,if sigmoid(x)<0.51,if sigmoid(x)0.5 f(x) = \begin{cases} 0, & \text{if sigmoid}(x) < 0.5 \\ 1, & \text{if sigmoid}(x) \geq 0.5 \end{cases}

To understand this, consider the following graph of the sigmoid function:

Sigmoid Graph

From the graph, we observe that as ( x ) becomes negative, the output ( y ) approaches 0. As ( x ) becomes positive, ( y ) approaches 1, with the output being 0.5 when ( x ) is around 0.

Cost Function

A cost function is used to measure the error between the predicted output and the actual class labels, penalizing the model for incorrect predictions. This allows the model to adjust its weights (θ) to minimize this error during training.

The cost function for logistic regression on any input instance is given below:

Cost(ypred,y)=ylog(ypred)(1y)log(1ypred) Cost(y_{\text{pred}}, y) = -y \log(y_{\text{pred}}) - (1 - y) \log(1 - y_{\text{pred}})

And the overall cost function is given below:

Cost(θ)=1mi=1mCost(hθ(xi),yi) Cost(\theta) = \frac{1}{m}\sum_{i=1}^{m}Cost(h_{\theta}(x^i),y^i)

This is the standard logistic loss function used for binary classification, where y_pred is the predicted probability of the positive class (class 1), and ( y ) is the actual class label (either 0 or 1).

To understand this function Cost(y_pred,y) , let's break it down for two cases, y = 0 and y = 1:

  • Case ( y = 0 ):

When the actual label y = 0 , the first part of the cost function, -ylog(y_pred) , becomes 0. The cost is then determined by the second part, (1 - y)log(1 - y_pred) , which simplifies to log(1 - y_pred). This represents the penalty for incorrectly predicting a high probability for the positive class when the true label is 0.

  • Case ( y = 1 ):

When the actual label y = 1, the second part of the cost function, -(1 - y)log(1 - y_pred), becomes 0. The cost is then determined by the first part, -ylog(y_pred), which simplifies to -log(y_pred). This represents the penalty for incorrectly predicting a low probability for the positive class when the true label is 1.

The goal during training is to minimize this cost function across all training examples, effectively finding the model parameters θ that yield the most accurate predictions.

Gradient Descent

Gradient Descent is the process in which the models attempts to find the local minima or values of θ, which leads to a minimum cost. The steps that the model takes to reach that local minima is called as Learning Rate and is generally 0.01 . The Gradient of the logistic model is represented as follows:

grad=1mdot.product(xT,(ypredy)) grad = {\frac{1}{m}} * dot.product(x^T,(y_{pred}-y))

Once the gradient is calculated, the weights θ, are updated as follows:

θ=θ(LearningRategradient) θ = θ - (LearningRate * gradient)

Loss Functions

The loss functions are functions that determine the degree of accuracy of the model, ie, the difference between the predicted output and the actual output. The two loss functions used in this post are: Mean Squared Error and Mean Absolute Error.

Mean Squared Error (MSE)

Mean Squared Error (MSE) is the function that calculates the average of the squared differences between the predicted output and the actual output. It penalizes larger errors more than smaller ones. The formula for MSE is:

MeanSquaredError(ypred,y)=1mi=1m(yprediyi)2Where:(ypredi)  is  the  predicted  output  for  the  (i)th  sample.(yi)  is  the  actual  output  for  the  (i)th  sample.(m)  is  the  total  number  of  samples. MeanSquaredError(y_{pred}, y) = \frac{1}{m} \sum_{i=1}^{m} (y_{pred}^i - y^i)^2\\ Where: \\ ( y_{pred}^i )\space\space is\space\space the\space\space predicted \space\space output \space\space for\space\space the \space\space (i)-th \space\space sample. \\ ( y^i ) \space\space is \space\space the\space\space actual\space\space output\space\space for\space\space the\space\space (i)-th\space\space sample. \\ ( m )\space\space is\space\space the\space\space total\space\space number\space\space of\space\space samples.\\

Mean Absolute Error (MAE)

Mean Absolute Error (MAE) is the function that calculates the average of the absolute differences between the predicted output and the actual output. Unlike MSE, MAE treats all errors equally and does not penalize larger errors more than smaller ones. The formula for MAE is:

MeanAbsoluteError(ypred,y)=1mi=1myprediyiWhere:(ypredi)  is  the  predicted  output  for  the  (i)th  sample.(yi)  is  the  actual  output  for  the  (i)th  sample.(m)  is  the  total  number  of  samples. MeanAbsoluteError(y_{pred}, y) = \frac{1}{m} \sum_{i=1}^{m} |y_{pred}^i - y^i| \\ Where: \\ ( y_{pred}^i )\space\space is\space\space the\space\space predicted\space\space output\space\space for\space\space the\space\space (i)-th\space\space sample. \\ ( y^i )\space\space is\space\space the\space\space actual\space\space output\space\space for\space\space the \space\space (i)-th\space\space sample. \\ ( m )\space\space is\space\space the\space\space total\space\space number\space\space of\space\space samples. \\

Program

The dataset used in this post is taken from kaggle. The code to read and perform pre-processing is provided below:

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("./Titanic Train Data.csv")
t = df.drop(columns=["Name", "Sex", "Fare", "Ticket", "SibSp","Cabin","PassengerId"])

# fixing empty age
average_age = 0
number_of_rows = len(t)
for i in range(number_of_rows):
    if not np.isnan(t['Age'].iloc[i]):
        average_age += t['Age'].iloc[i]
average_age = average_age / number_of_rows
t['Age'].replace(np.nan,average_age, inplace=True)

# fixing empty embarked
t['Embarked'].replace(np.nan,"S",inplace=True)

# categorising the embarked values
t['Embarked'].replace({'S': 0, 'C': 1, 'Q': 2}, inplace=True)

x  = t[['Age','Pclass','Parch','Embarked']]
y = t['Survived']

x_train, x_test , y_train, y_test = train_test_split(x,y,random_state=42)
Enter fullscreen mode Exit fullscreen mode

Logistic Regression Program from Scratch

from sklearn.metrics import mean_squared_error , mean_absolute_error

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def cost_function(x, y, theta):
    m = len(y)
    z = np.dot(x, theta)  
    y_pred = sigmoid(z)  
    cost = (1 / m) * np.sum((-y * np.log(y_pred)) - ((1 - y) * np.log(1 - y_pred)))
    return cost

def train(x, y, learning_rate, epochs):
    m, n = x.shape 
    theta = np.zeros(n)  
    for i in range(epochs):
        z = np.dot(x, theta)  
        y_pred = sigmoid(z)
        gradient = (1 / m) * np.dot(x.T, (y_pred - y)) 
        theta -= learning_rate * gradient
        cost = cost_function(x, y, theta)
        if i % 100 == 0:
            print(f"Cost: {cost:.4f} | Epoch: {i}")
    #displaying only weights
    print(f"Final weights: {theta[1:]}")
    return theta

def predict(x, theta):
    z = np.dot(x, theta)  
    y_pred = sigmoid(z) 
    return y_pred >= 0.5 
Enter fullscreen mode Exit fullscreen mode

Driver Code

x_train = np.hstack([np.ones((x_train.shape[0], 1)), x_train])  
x_test = np.hstack([np.ones((x_test.shape[0], 1)), x_test])    

theta = train(x_train, y_train, learning_rate=0.01, epochs=1000)
pred = predict(x_test, theta)

accuracy = np.mean(pred == y_test)
print(f"Accuracy: {accuracy:.4f}")

MSE = mean_squared_error(pred, y_test)
MAE = mean_absolute_error(pred, y_test)
print(f"MSE: {MSE}")
print(f"MAE: {MAE}")
Enter fullscreen mode Exit fullscreen mode

Output

Cost: 0.7098 | Epoch: 0
Cost: 0.7498 | Epoch: 100
Cost: 0.7454 | Epoch: 200
Cost: 0.7394 | Epoch: 300
Cost: 0.7332 | Epoch: 400
Cost: 0.7272 | Epoch: 500
Cost: 0.7216 | Epoch: 600
Cost: 0.7164 | Epoch: 700
Cost: 0.7116 | Epoch: 800
Cost: 0.7071 | Epoch: 900

Final weights: [ 0.02662689 -0.44641909  0.22831224  0.27235578]

Accuracy: 0.6457

MSE: 0.3542600896860987
MAE: 0.3542600896860987
Enter fullscreen mode Exit fullscreen mode

Note:

In the following line:

x_train = np.hstack([np.ones((x_train.shape[0], 1)), x_train])  
x_test = np.hstack([np.ones((x_test.shape[0], 1)), x_test])       
Enter fullscreen mode Exit fullscreen mode

A column of ones is added to the input feature matrices, introducing a bias term in the logistic regression model. The bias term is a parameter that allows the model to make predictions by shifting the output independently of the input features. This helps the model fit data that doesn't necessarily pass through the origin. The bias term is learned during training, allowing the model to adjust the decision boundary and make more accurate predictions.

The bias term is represented by:

θ0\theta_0

Note that the Sklearn Logistic Regression model already handles this bias term internally, so when using the pre-built model, you don't need to manually add it to the input features, as is done above.

Logistic Regression Program using Sklearn

Program

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error , mean_absolute_error

log_model = LogisticRegression()
log_model.fit(x_train,y_train)
model_predictions = log_model.predict(x_test)
MSE = mean_squared_error(model_predictions,y_test)
MAE = mean_absolute_error(model_predictions,y_test)

print(f"Final Weights: {log_model.coef_}")
print(f"MSE: {MSE}")
print(f"MAE: {MAE}")
Enter fullscreen mode Exit fullscreen mode

Output

Final Weights: [[-0.02804986 -1.02796635  0.20322083  0.40791547]]
MSE: 0.26905829596412556
MAE: 0.26905829596412556
Enter fullscreen mode Exit fullscreen mode

Top comments (0)