DEV Community

Pratik Patre for CoffeeBeans Consulting

Posted on

Extending Linear Programming to Statistical Learning

What is linear programming?

In simple words, it is a mathematical solution to a complex problem. Believe it or not, we have been using linear programming in everyday life (technically not programming in our minds but the concept). Simple examples would be packing your bag for a trip, managing time in an exam or buying a variety of chocolates within the set budget. It is basically an optimisation problem that we try to solve and by saying that I meant frequentist techniques like Maximum Likelihood (MLE). Mathematically, we usually solve by using some kind of variables and constraints to get the maximum or the minimum of the objective set.

The process to formulate a Linear Programming problem?

Let us look at the steps of defining a Linear Programming problem generically:

  1. We first decide the unknowns which act as decision variables
  2. We define the objective function
  3. Identify the constraints i.e. the system of equalities or inequalities representing the restrictions on the decision variables
  4. Non-negativity restriction

For a problem to be a linear programming problem, the decision variables, objective function and constraints all have to be linear functions. If all the three conditions are satisfied, it is called a Linear Programming Problem.

Linear Programming in action

Ordinary Least Squares (OLS) Basics
Let us see how linear programming can be clubbed with gradient descent and used in linear regression in machine learning to solve an optimisation problem. Let's start simple, in linear regression we predict a quantitative response y on the basis of a single predictor variable x. It assumes a linear relationship between x and y i.e. y ~ w0 + w1*X. We use the training data to produce model coefficients w0 and w1. We can predict yhat based on X = x i.e. yhat = w0 + w1*x. Using the training data we find a line that most closely represents the data points. There are various ways to measure what it means to closely represent. We may, for instance, minimise the average distance (deviation) of the data points from the line, or minimise the sum of distances, or the sum of squares of distances, or minimise the maximum distance of a datapoint from the line. Here the distance can be either Euclidean distance, or vertical distance, or Manhattan distance (vertical+horizontal), or other.
We choose to minimise the maximum vertical distance of a point from the line. In two dimensional space, a general equation of a line with finite slope has the form y = w0 + w1*x where w0 and w1 are parameters (slope and constant respectively). For a point (p, q), the vertical distance of the point from the line y = w0 + w1*x can be written as |q − w1*p − w0| which is nothing but residual (e1) and if we square it and sum it over n number of points in our data, we get the residual sum of squares i.e. RSS = e1² + e2² ….en².

Let's Start with an Example

Defining decision variable and objective function
Now consider, we want to predict a house's price with respect to one variable which is the area of the house. So this is our decision variable which we denote as x. Say for a sample point, g(w) = w0 + w1*x is the expected price of a house based on x area of the house. Now, y = g(w) = w0 + w1*x will be marked as an objective function as we have to optimise this line to best fit the data.

Now I will not go into the proof but if we consider the partial derivative of y with respect to w0 and w1 and represent that in a vector form that's when we get the gradient of y. So, below would be the mathematical expression.

Image description

The cost function is the residual sum of squares and the gradients for the RSS with respect to w0 and w1 will be as shown by RSS(w0,w1) above in vector form.

Defining Constraints
So now we can define our constraints,

We will have to update the w0 and w1 based on the following equations:
w0 = w0 -Learning rate * partial derivative of RSS function with respect to w0
For simplicity we can write it w0 = w0 + 2N * B…………..(2)
Where B = (sum of errors after prediction over all data points)

w1 = w1 -Learning rate * partial derivative of RSS function with respect to w1
For simplicity we can write it w1 = w1 + 2N * C..…………(3)
Where C = (sum of (errors*x) after prediction over all data points)

We repeat the above two steps until convergence, i.e. until B-C < t (tolerance value)

Now we can rewrite it as B + (-C) < t

Squaring on both sides (B + (-C))² < t²

Subtracting 2*B*C on both sides (B + (-C))² -2*B*(-C) < t²- 2*B*(-C)

Now a² + b² formula, LHS will turn to be B² + (-C)² i.e. B² + C² < t² + 2*B*C

Taking square root on both the sides we get
Square root(B² + C²) < square root(t² + 2*B*C) ………….(1)

Now technically the convergence is the magnitude of the gradient vector = 0 but in practice we consider the magnitude of the gradient vector < T
Where T is tolerance value

Now as per the equation (1), the LHS is nothing but the magnitude of the gradient vector and the RHS is the Tolerance value. We can rewrite (1) as follows
Square root(B² + C²) < T ………….(1a)

Defining Non-negativity restriction
The non-negativity restriction i.e. the area of the house will be greater than or equal to zero x >= 0.

So now we have formulated our linear program with the help of gradient descent, let us solve for the optimum line.

Operationalising Linear Programming

Let's consider the following data points

Image description

We start with assuming intercept = w0 = 0 and slope = w1 = 0 and by setting stepsize or learning rate = 0.025 and tolerance = 0.01

Step 1: We predict and find for yhat based on yhat = w0 + w1*x.
We get the following predictions for the five data points [0,0,0,0,0]

Step 2: We find the errors after prediction by yhat - y
We get the following errors [-1,-3,-7,-13,-21]

Step 3: We update the intercept by summing the errors and inserting it into equation (2) to get
0 + 2*0.025*sum([-1,-3,-7,-13,-21]) = -2.25

Step 4: We update the slope by summing the multiplication of errors and respective area of house as per equation (3). Thus equation 3 will turn to be:
0 + 2*0.025*sum([0, 1, 2, 3, 4] * [-1, -3, -7, -13, -21]) = -7

Step 5: We calculate the magnitude which is LHS from equation (1a):
sqrt(( -45)² + (-140)²) = 147.05

Step 6: We check equation (1a) to see if it converged. Since 147.05 is greater than 0.01 (the tolerance value set by us), we repeat the steps again but this time with the updated w0 and w1.

Since the purpose of this session is to see how linear programming can be used in OLS, we will stop here. But feel free to code this and get the optimum value for w0 and w1.

Top comments (0)