Harsimranjit Singh

Posted on Aug 12

Support Vector Machines: From Hard Margin to Soft Margin

Support Vector Machines are powerful tools in machine learning for classifying data and predicting values. They are popular in various fields like bioinformatics and financial forecasting because they handle complex data problems well. This article aims to explain SVMs in a simple way, covering topics like maximal margin classifier, support vectors, kernel trick and infinite dimensional mapping.

What is SVMs

An SVM performs classification at its core by finding the hyperplane that best divides a dataset into classes. The unique aspect of SVMs lies in their ability to find the optimal hyperplane that maximizes the margin- the distance between the hyperplane and the nearest data points from each class. A larger margin helps the SVM make better predictions on new data as decision boundaries are much clearer. The SVM does this with the help of Support Vectors.

Support Vectors

Support Vectors are the subset of training data that are directly involved in constructing the optimal separating hyperplane These points are crucial because they lie on the edge of the margin and influence the position and orientation of the hyperplane.

Mathematically, if you have a dataset with class labels $y_i \in$ {-1, +1} and feature $X_i$ the support vectors satisfy the condition:

$y_i (w \cdot x_i + b) = 1$

From where this equation comes?

The hyperplane in d-dimensional space is defined by the equation:

W.X - b = 0
where:

w is the normal vector( normal to the hyperplane).
b is the bias term (offset from the origin).

For a given data point $X_i$ with class label $y_i$ :

if $y_i=1$ , the data point belongs to the positive class.
if $y_i=-1$ , the data point belongs to the negative class.

We want the hyperplane to correctly classify these data points, so we need to ensure that:

Points from the positive class are on one side of the hyperplane. For points on the positive side: $w \cdot x_i - b \geq 1$
Points from the negative class are on the other side.

For points on the negative side:
$w \cdot x_i - b \leq -1$

To combine these constraints into simple form, we use the class label $y_i$ , the constraint can be written as:

$y_i (w \cdot x_i + b) \geq 1$
Here's why:

if $y_i$ = 1, the constraint becomes $w \cdot x_i - b \geq 1$ , which ensures correct classification for positive class points.
if if $y_i$ = -1, the constraint becomes $w \cdot x_i - b \leq 1$ , which ensure correct classification for negative class points.

Margin Calculation

In SVMs, the margin is the distance between the hyperplane and the closest data points from each class(support vectors).

To calculate the margin, we use the following formula:
$\text{Margin} = \frac{2}{||w||}$

From where do we get this formula

The perpendicular distance d from point $X_i$ to hyperplane is:
$d = \frac{|w \cdot x_i - b|}{||w||}$
Now for the support vectors, the distance from the hyperplane is exactly
$d = \frac{1}{||w||}$
This is because support vectors lie on the boundaries of the margin, where
$w \cdot x_i - b = 1$
Therefore:
$d = \frac{1}{||w||}$

Now we need the distance between the both hyperplanes
$w \cdot x_i - b = 1$
$w \cdot x_i - b = -1$
Therefore the distance will be:
$\text{Margin} = \frac{2}{||w||}$

Understanding Hard Margin SVM

The term "Hard Margin" comes from the fact that the algorithm requires all data points to be classified with a margin of at least 1. In other words, there are no allowances for misclassification. These strict requirements are why it's called a "hard" margin

Formulating the Hard Margin SVM

1. Objective function

The goal of Hard margin SVM is to maximize the margin between the two classes. As we previously discussed:

$\text{Margin} = \frac{2}{||w||}$

To maximize the margin, we need to minimize its reciprocal
$\text{Minimize } \frac{1}{2} |w|^2$

Why squaring the norm? Because it provides smoothness and differentiability. This makes it easier to compute gradients and perform optimization using gradient-based methods.
Minimizing the squared norm is equivalent to minimizing the norm because minimizing the $||w||^2$ will always lead to the same optimal W as minimizing $||w||$

2. Constraints:

The constraints ensure that each point is correctly classified and lies at least at a distance of 1 from the hyperplane.
$y_i (w \cdot x_i + b) \geq 1$

3. Hard Margin SVM Optimization Problem:

Putting it all together, the hard-margin SVM optimization problem is:
$\text{Minimize } \frac{1}{2} |w|^2 \text{ subject to } y_i (w \cdot x_i - b) \geq 1, \, \forall i$

Now we need to solve this problem to find the solution

Problem with Hard Margin

While Hard Margin SVMs are effective for linearly separable data, they come with certain limitation

It fails in the case of outliers and misclassified data

In this two points are outliers, in this scenario, hard SVM fails to plot the decision boundary as it tries to classify all the points but is unable to classify these two points.

To tackle this Soft-Margin SVM is used

Soft Margin SVM

While Hard Margin SVM works well with linearly separable data, it struggles with datasets containing outliers or overlapping classes. To address these limitations, Soft Margin SVM introduces a concept called "slack Variables"

Slack Variables

Slack variables $\xi_i$ is introduced to measure the degree to which a data point is allowed to violate the margin constraints. They allow for some misclassification and margin violations. For each data point i, the slack variable $\xi_i$ is non-negative and represents the extent of the constraint violation.

if $\xi_i$ =0, the data point is correctly classified and lies on or outside the margin.
if 0< $\xi_i$ <1, the data point is correctly classified but withing the margin.
if $\xi_i$ >=1, the data point is misclassified.

Soft-Margin SVM: Objective Function and Constraints

Objective Function
The soft-margin SVM modifies the objective function to incorporate slack variables. The goal is now to balance between maximizing the margin and minimizing the classification error. The new objective function is:
$\text{Minimize } \frac{1}{2} ||w||^2 + C \sum_{i=1}^n \xi_i$
where:

$\frac{1}{2} ||w||^2$ still represents the margin maximization term.
C is the regularization parameter that controls the trade-off between the margin size and classification error.
$\xi_i$ are the slack variables representing the extent of misclassification for each data point.

Constraints
The constraints are modified to account for slack variables:
$y_i (w^T x_i + b) \geq 1 - \xi_i$
$\xi_i \geq 0$
where:

The term $1 - \xi_i$ allows some flexibility by permitting data points to lie within or beyond the margin.
$\xi_i \geq 0$ ensure that slack variables are non-negative, representing non-negative deviations from the ideal margin.

What is $\xi_i$

Hinge Loss
The hinge loss function is crucial concept in understanding the role of $\xi_i$ in SVMs. it measures the cost associated with misclassifications and margin violations. The hinge loss for data point ( $x_i,y_i$ ) is defined as:
$L(y_i, f(x_i)) = \max(0, 1 - y_i f(x_i))$
$\text{where } f(x_i) = w \cdot x_i + b$

Intuition behind the Hinge Loss Function

1. Decision Function:

in SVMs, the decision function $f(x_i)$
where w is weight vector, x is input feature vector, and b is bias term.

2. Margin and Classification:

For a correctly classified point, we want $y_if(x_i) \geq 1$ . This ensures that point is on the correct side of the margin.
For a misclassified point $y_if(x_i) < 1$ . ### 3. Hinge Loss: The hinge loss penalizes points that do not satisfy the margin requirement $y_if(x_i) \geq 1$ . The Loss is defined as:

$L(y_i, f(x_i)) = \max(0, 1 - y_i f(x_i))$

Correct Classification with Margin:
where $y_if(x_i) \geq 1$
$L(y_i, f(x_i)) = \max(0, 1 - y_i f(x_i)) =** 0**$
This means there is no penalty because the point is correctly classified and lies outside the margin.
Margin Violation:
when $0 \leq y_if(x_i) < 1$
$L(y_i, f(x_i)) = \max(0, 1 - y_i f(x_i)) = 1-y_if(x_i)$
This means the point is correctly classified but lies within the margin, resulting a penalty proportional to distance from the margin.
Misclassification:
when $y_if(x_i) < 0$
$L(y_i, f(x_i)) = \max(0, 1 - y_i f(x_i)) = 1-y_if(x_i)$
This means the point is misclassified, resulting in penalty that increases as the point moves further away from the correct side of the decision boundary.

Explaining Hinge Loss with example

Let's use small dataset consisting of five points:
(1,2) with label 1
(2,2) with label 1
(2.5,2.5) with label -1
(3,1) with label -1
(3,3) with label

Let's fit a linear SVM model with soft margin(C=1)

from the trained model:

w = [0.67, 0.67]
b = -2

Hinge Loss Calculation

For each point, we calculate the hinge loss:

Point(1,2):
- y=1
- $f(x)$ = 0.67.1 + 0.67.2 - 2 = 1.00
- Hinge Loss: max(0,1- $y.f(x)$ ) = max(0,1-1.1.00) = 0.00
- Correctly classified and outside margin
Point (2,2):
- y=1
- $f(x)$ = 0.67 x 2 + 0.67 x 2 - 2 = 0.50
- Hinge Loss: max(0,1- $y.f(x)$ ) = max(0,1-1 x 0.50) = 0.50
- Correctly classified but within margin
Point (2.5,2.5):
- y = -1
- $f(x)$ = 0.67 x 2.5 + 0.67 x 2.5 - 2 = 1.33
- Hinge Loss: max(0,1- $y.f(x)$ ) = max(0,1+ 1.33) = 2.33
- Misclassified
Point (3,1):
- y = -1
- $f(x)$ = 0.67 x 3 + 0.67 x 1 - 2 = -1.00
- Hinge Loss: max(0,1- $y.f(x)$ ) = max(0,1 - (-1)*(-1)) = 0.00
- Correctly classified and outside margin
Point (3,3):
- y = 1
- $f(x)$ = 0.67 x 3 + 0.67 x 3 - 2 = 2.00
- Hinge Loss: max(0,1- $y.f(x)$ ) = max(0,1 - 2.00) = 0.00
- Correctly classified and outside margin

import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC

X = np.array([[1, 2], [2, 2], [2.5, 2.5], [3, 1], [3, 3]])
y = np.array([1, 1, -1, -1, 1])  


model = SVC(kernel='linear', C=1)
model.fit(X, y)
w = model.coef_[0]
b = model.intercept_[0]
x_plot = np.linspace(0, 4, 100)
y_plot = -w[0] / w[1] * x_plot - b / w[1]


margin = 1 / np.sqrt(np.sum(w ** 2))
y_margin1 = y_plot + margin
y_margin2 = y_plot - margin


plt.scatter(X[:, 0], X[:, 1], c=y, cmap='bwr', s=50)
plt.plot(x_plot, y_plot, 'k-', label='Decision Boundary')
plt.plot(x_plot, y_margin1, 'k--', label='Margin')
plt.plot(x_plot, y_margin2, 'k--')


for i, txt in enumerate(y):
    plt.annotate(f'({X[i][0]},{X[i][1]})', (X[i][0], X[i][1]))

plt.annotate('Misclassified Point', xy=(2.5, 2.5), xytext=(3, 1.5),
             arrowprops=dict(facecolor='red', shrink=0.05),
             fontsize=10, color='red')
plt.annotate('Correctly Classified\nBut Within Margin', xy=(2, 2), xytext=(0.5, 2.5),
             arrowprops=dict(facecolor='blue', shrink=0.05),
             fontsize=10, color='blue')

plt.xlim(0, 4)
plt.ylim(0, 4)
plt.title('Soft Margin SVM')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

def hinge_loss_point(x, y, w, b):
    f_x = np.dot(w, x) + b
    return max(0, 1 - y * f_x)

for i in range(len(X)):
    x_i = X[i]
    y_i = y[i]
    loss = hinge_loss_point(x_i, y_i, w, b)
    print(f'Point {i+1}: x={x_i}, y={y_i}, f(x)={np.dot(w, x_i) + b:.2f}, Hinge Loss={loss:.2f}')

Short Summary

Correctly classified and outside the margin: No penalty like Point (1,2), Point (3,1) and Point (3,3)
Correctly classified but within the margin: Penalty proportional to distance from margin like Point (2,2)
Misclassified: High penalty, increasing with distance from the correct side of the boundary

Constraint for Soft-Margin

$y_i (w^T x_i + b) \geq 1 - \xi_i$
$\xi_i \geq 0$

Why $1 - \xi_i$

Let's take simple examples in different cases:

Point A: Correctly classified and outside the margin
- Slack Variable: $\xi_A = 0$
- Constraint: $y_A (w^T x_A + b) \geq 1 - \xi_A$
- Satisfies: The constraint is met because $\xi_A$ is 0, so no margin violation
Point B: Correctly classified but within the margin $0 < xi_i < 1$
- Slack Variable: $\xi_A = 0.5$
- Constraint: $y_B (w^T x_B + b) \geq 1 - \xi_B$
- Satisfies: The point is within the margin but still classified correctly. The slack variable $\xi_B$ of 0.5 allows this.
- Interpretation
  - Without Slack Variable: The constraint would be $y_B (w^T x_B + b) \geq 1$ . This means the point xB would need to be classified correctly and be at least 1 uint away from the decision boundary
  - With Slack Variables: The slack variable $xi_B$ of 0.5 allows the constraint to be relaxed. it changes the requirements from 1 to 0.5, meaning the point can be within the margin but still satisfy the relaxed constraint.
Point C: Misclassified
- Slack Variable: $\xi_A = 1$
- Constraint: $y_B (w^T x_B + b) \geq 0$
- Satisfies: The point is misclassified but the slack variable $\xi_C$ of 1 relaxes the constraint enough to allow this.

Real Objective Function

in the Soft Margin SVM, we aim to balance two objectives:

Maximizing the margin (which is equivalent to minimizing $| \mathbf{w} |^2$
Minimizing the classification error Which is related to the slack variables

Now we need to

$\text{Minimize } \frac{1}{2} ||w||^2 + C \sum_{i=1}^n \xi_i$
under
$y_i (w^T x_i + b) \geq 1 - \xi_i$
$\xi_i \geq 0$
The Soft Margin SVM problem is a type of quadratic programming problem. It involves a quadratic objective function and linear constraints. To solve the problem efficiently, it is often useful to convert it to its dual form using Lagrange multipliers.
The dual form of this equations is
$\max_{\alpha} \left( \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j y_i y_j (x_i^T x_j) \right)$

Derivation

Regularization Parameter (C) in SVM

The regularization parameter C in SVM plays a crucial role in the bias-variance tradeoff:

Hign C Value:
- Effect:A high C value aims to minimize the training error by allowing fewer misclassifications on the training set. This reduces bias but increases variance, making the model more prone to overfitting.
- Result: The decision boundary becomes very tight around the support vectors, potentially fitting noise in the training data
Low C Value:
- Effect: A low C value allows more misclassifications on the training set, which increases bias but reduces variance. The model is less flexible and smoother
- Result: The decision boundary is smoother and less sensitive to training data, which can prevent overfitting but might lead to underfitting.

Conclusion

Support Vector Machines offer a robust approach for classification tasks, excelling in scenarios where a clear decision boundary is essential. By understanding key concepts such as support vectors, the margin, slack variables, and hinge loss, practitioners can effectively leverage SVMs for a wide range of machine learning applications.