loading...

Help Me Understand This Vectorized Logic

rpalo profile image Ryan Palo ・2 min read

I'd like to get some help understanding vectorized operations on multi-dimensional arrays. Specifically, I've got a problem and some code that I think should work, but it's not working, and I'm sure it's because my thinking is wrong, but I can't figure out why.

Some caveats:

  • This is for some homework. I really don't want a plop of code that I'm supposed to copy/paste without understanding. If I wanted that, I'd go to StackOverflow. I want the concepts.
  • I want to do this only using numpy. I know that scipy and other ML libraries have fancy functions that would do what I'm asking about in a black box, but that's not what I want. This is a learning exercise.

Here's the scenario:

The Scenario

I've got two datasets of Iris data (yes, that Iris data)--a training set and a test set. Both sets have 4 columns of float values, and an associated vector of labels classifying each of the data points.

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3,1.4,0.2,Iris-setosa
7,3.2,4.7,1.4,Iris-versicolor
...

We're doing a 1-Nearest-Neighbor classification. The goal is to do the following:

  • For each data point in the testing set, compare it to all the points in the training set by calculating the "distance" between the two. Distance is calculated as
def distance(x, y):
    return math.sqrt((x[0] - y[0])**2 + (x[1] - y[1])**2 + (x[2] - y[2])**2 + (x[3] - y[3])**2)

Also known as the Root-Sum-Square of the differences between each feature of each point.

Now. Here's what I have right now:

My Code

import numpy as np

def distance(x, y):
    return np.sqrt(np.sum((x - y)**2, axis=1))


def main():
    # ... blah blah load data
    # training_data is 75 rows x 4 cols of floats
    # testing_data is 75 rows x 4 cols of floats
    # training_labels is 75 rows x 1 col of strings
    # testing_labels is 75 rows x 1 col of strings

    # My thought is to use "broadcasting" to do it without loops
    # so far, to me, "broadcasting" == "magic"

    training_data = training_data.reshape((1, 4, 75))
    testing_data = testing_data.reshape((75, 4, 1))

    # So this next bit should work like magic, producing a 75 x 1 x 75 matrix of
    # distances between the testing data (row indices) and the training data
    # (column indices)

    distances = distance(testing_data, training_data)

    # And the column index of the minimum distance should in theory be the 
    # index of the training point that is the "closest" to the given testing point
    # for that row

    closest_indices = distances.argmin(axis=1)

    # And this should build an array of labels corresponding to the indices
    # gathered above
    predicted_labels = training_labels[closest_indices]

    number_correct = np.sum(predicted_labels == testing_labels)
    accuracy = number_correct/len(testing_labels)

And this all seems right to me.

But.

When I run it, per my prompt, I should be expecting an accuracy somewhere in the .94 range, and I'm getting something in the .33 range. Which is poop.

So. What am I missing? What key concepts am I totally misunderstanding?

Thank you!

Discussion

pic
Editor guide
Collapse
evanoman profile image
Evan Oman

Without getting into the weeds, here are my high-level thoughts:

Given the input data and approach, are you sure that you should be able to get accuracy in the 0.94 range? Does the exercise indicate you should be able to get results like this using the algorithm approach you tried?

This ☝️ point brings up a second important point: it would be useful for you to decompose the problem into its key components and check each of those components separately. Are you sure the loading and broadcasting steps are working like you think it should? Have you tried your distance function on some simple data and got the result you expected? Finally, if all of the above are working as expected, could you put together some data which should receive 100% accuracy, and then test that out?

By looking at each piece individually (ideally with a light-weight unit test) you can start to narrow down the list of potential causes.

Collapse
rpalo profile image
Ryan Palo Author

Thanks for your help. I had been putting off tests, but that was the next step in the plan.

Actually, running through a much simpler case in a REPL ended up doing it for me. See my comment about my solution.

But your comments about going back to debugging basics and slowly and methodically validating one piece of logic at a time were what put me back on the right track, so thanks!

Collapse
evanoman profile image
Evan Oman

Glad you were able to figure it out, nice work!

Collapse
rpalo profile image
Ryan Palo Author

I've got an article in the works that walks through this in more detail, but I wanted to post my solution in case somebody ran into the same or similar issue.

The main problem I was having was using the reshape method is the wrong one. It would give me the right dimensions, but it jumbled up all of the individual numbers and didn't keep the "records" together.

After doing some experimenting in a REPL with simpler cases, I discovered that what I really wanted was swapaxis. This keeps the numbers in the correct order, but allows you to pivot an array into other dimensions (e.g. roll your 2D array into a plane in the third dimension).

So what I ended up with is:

def vectorized_distance(vec1, vec2):
    return np.sqrt(np.sum((vec2 - vec1)**2, axis=1))


def nearest_neighbor_classify(data, neighbors, labels):
    # Reshape data so that broadcasting works right
    data_rows, data_cols = data.shape
    data = data.reshape((data_rows, data_cols, 1))
    neighbor_rows, neighbor_cols = neighbors.shape
    flipped_neighbors = np.swapaxes(
        neighbors.reshape((neighbor_rows, neighbor_cols, 1)),
        0, 2)

    # Now data should be (n x m x 1) and flipped_neighbors (1 x m x n)
    # Broadcasting should produce an (n x m x n) array, but `np.sum` will 
    # squash axis 1 so we get a (n x n) point-to-point distance matrix

    distances = vectorized_distance(data, flipped_neighbors)

    # The index of the smallest value for each row is the index of the prediction
    closest_neighbor_indices = distances.argmin(axis=1)

    return labels[closest_neighbor_indices]
Collapse
evanoman profile image
Evan Oman

Ah, is this column-major vs row-major ordering issue?

Collapse
rpalo profile image
Ryan Palo Author

Yeah or maybe the multidimensional version of that, although I tried numpy’s different ordering strategies and none seemed to work quite right.