DEV Community

Georgi Serev for This Dot

Posted on

Implementing a primitive OCR using k-NN

In this article, we are going to implement a really primitive optical character recognition using the k-nearest neighbor classification algorithm. Our language of choice will be JavaScript. Before we move on, we will take a look at what k-NN is, and how it actually works.

k-NN—quick introduction

Let's imagine that we have a forest with three animal species: bears, wolves and foxes. Now consider that we are wildlife researchers who have information about the position of every animal in this forest via GPS trackers. Our data shows that the different species occupy different areas of the forest.

However, one day, our low-quality thermal cameras detect an unknown animal at coordinates M and N in that forest. We should classify that animal.

Hope you liked the short story, but it's time for us to look at the data we have. We will represent the forest as a 2-dimensional Euclidean space:

The forest

Seeing the diagram, you might say "Well, the closest animal is a bear, so it must be a bear" and you won't be exactly wrong. But, what if we take the two closest animals, the bear and the wolf? In that case, we can't say for sure what the unknown animal is. What about three? Then it's most likely a wolf.

You probably get where we are going. k-NN, or as its name says, "nearest neighbor," determines which are the k closest neighbors to the object we are attempting to classify. In the case of k = 1 we are actually performing nearest neighbor search which is a special case of k-NN. k = 2 results in an ambiguous output*. However, when k is 3, we receive a satisfactory result. So, as you might have guessed, picking an appropriate k is important for the accuracy of the algorithm.

* In case we have an even k, and ambiguous result, we are comparing the distances of the k-closest neighbors. This is the so-called "modified k-NN." That's why, it is advised to pick an odd k when using a conventional k-NN.

OCR implementation

Now that we know what k-NN is, and how it works, we can focus on our task, which is implementing an OCR. Keep in mind that this is not a production-quality one, nor is it super efficient, but it should give us a good perception of the capabilities of k-NN. Let's start with preparing our training data.

Training data

Just like we had the coordinates of the animals from the forest, we will need some example data for our OCR. This data is called the training data, and since k-NN is a supervised algorithm, it will need it in order to analyze, and classify, the input we provide.

Important note: In reality, k-NN is not really trained unlike other supervised algorithms where we build a model. Because of this, we will refer to "train" or "training" as the process of supplying example data to our algorithm, which is then used during runtime for classification.

For our OCR, we will introduce only the lowercase letters "a," "b," and "c," and will have 4 versions for each. This is a really small set, but it should work relatively well for our demo. Respectively, the larger the training data is, the more accurate results you might expect.

Training set

Every letter is comprised of 20 dots, which have x and y in the form:

[{ x: 0, y: 1 }, { x: 2, y: 4 }, ... ]
Enter fullscreen mode Exit fullscreen mode

You can check the full data at GitHub.

Okay, we should be good here. Let's move on.

The application

For the purposes of our OCR, we will need a simple application for testing. We will create a 250 by 250 px plane in which we will render every click as a dark blue dot. Respectively, these dots will represent the points that compose a letter.

Note: The training data used for the algorithm was created using it.

The app

I won't go into details how to implement this app as it is simple—and more importantly—since it's not the subject of this article. You can head to the GitHub repository, and check out the code. It's nothing more than a few div-s, buttons and some event listeners attached to them.

GitHub repo files:

The algorithm

Nice, we reached the cool part of this article. I'll presume that you have taken a look at the app's code already, so we can begin our algorithm with creating a new class named OcrKNN:

export class OcrKNN {
  constructor(k, trainingData) {
    this.__k = k;
    this.train(trainingData);
  }

  test(data) {
    // todo
  }

  train(trainingData) {
    // todo
  }
}
Enter fullscreen mode Exit fullscreen mode

We'll create two methods: test will be used for testing an input data and determining its class (i.e. classification) whereas train will load our k-NN instance with the training/example data needed for the classification. As you can see, we are calling this method in our constructor, where we also pass the k value. Let's start with the implementation of the train method since it is a prerequisite for the testing (obviously). In it, we will perform the data formatting.

Data preparation and formatting

If you already took a look at the format of our training data, you would know that it is kept like this:

{ a: [Array, Array, ...], b: [Array, Array, ...], ... }
Enter fullscreen mode Exit fullscreen mode

However, in order to make our k-NN function as we want, we will have to convert this data so that it is easier to process (and will accommodate for some scenarios we will later explain). In our case, we are going to perform 3 operations:

  1. Sorting
  2. Normalizing
  3. Flattening
1. Sorting

Imagine that we have two 2-point uppercase "I"-s. They are composed like this:

First "I":

[
  { x: 10, y: 5 },
  { x: 10, y: 20 }
]
Enter fullscreen mode Exit fullscreen mode

Second "I":

[
  { x: 10, y: 20 },
  { x: 10, y: 5 }
]
Enter fullscreen mode Exit fullscreen mode

Obviously, they should be the same, but as you can see, the order of the points is different. You'll probably ask "Does it matter?" and the answer will be "In our case, yes." Since we are going to be calculating distances later in our code, an incorrect order can result in an inaccurate output. In case that's not clear now, I'll elaborate later on.

So, for that matter, we will introduce the __sort method:

export class OcrKNN {
  // ...

  __sort(data) {
    return data.slice().sort((a, b) => {
      const xDiff = a.x - b.x;
      if (xDiff !== 0) {
        return xDiff;
      }
      return a.y - b.y;
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

In short: it sorts the points in ascending order where the primary criteria is x and the secondary is y (i.e. if the x-s are equal, sort by y).

2. Normalization

Moving on to the normalization. Here we will take care of two potential problems that may occur during input—the position, and the size of the letter relative to our plane. First, let's tackle the position.

Our script should be able to distinguish a letter input regardless, if it was typed in the upper-left or lower-right corner of our plane. What we are going to do is to find the smallest x and y (mx and my) and then subtract them from the coordinates of every point. Hopefully, this graphical representation of the problem should give you an idea what the operation does:

Fixing the offset

Next, we will handle the different sizes of the letters. In similar fashion, we will take the biggest x and y from the dataset, but this time we will divide every point by it rather than subtracting. After this operation, we should end up with values between 0 and 1. This will be extremely helpful since now, we won't care about the actual pixels/positions, but for the ratios between the dots relative to 1. Therefore, a small and a large lowercase "a"-s will be virtually the same for our algorithm as long as the ratios between the dots are the same!

Ratio

All of this can be incorporated in the __normalize method:

export class OcrKNN {
  // ...

  __normalize(data) {
    const xs = data.map(l => l.x);
    const ys = data.map(l => l.y);
    const offsetX = Math.min(...xs);
    const offsetY = Math.min(...ys);
    const maxX = Math.max(...xs) - offsetX;
    const maxY = Math.max(...ys) - offsetY;

    return data.map((l) => ({
      x: (l.x - offsetX) / maxX,
      y: (l.y - offsetY) / maxY
    }));
  }
}
Enter fullscreen mode Exit fullscreen mode
3. Flattening

The final step of our data preparation will be the flattening. What we want to achieve is a single array with all of the points in the following format:

//  x1   y1    x2   y2       x3  y3
[    0, 0.1,    1, 0.5,    0.75,  0, ... ]
Enter fullscreen mode Exit fullscreen mode

I'll explain why we need this transformation later on. For now, let's just focus on the implementation of the flattening represented by yet another method called __flatten (for your amazement):

export class OcrKNN {
  // ...

  __flatten(data) {
    return data.reduce((arr, point) => {
      arr.push(point.x, point.y);
      return arr;
    }, []);
  }
}
Enter fullscreen mode Exit fullscreen mode

In the end, we will compose these methods in __format:

export class OcrKNN {
  // ...

  __format(data) {
    data = this.__sort(data);
    data = this.__normalize(data);
    return this.__flatten(data);
  }
}
Enter fullscreen mode Exit fullscreen mode

Simple, isn't it?

Finalize training process implementation

So far, so good. What's left is to go through the passed training set, and use the power of __format to make our data nice and tidy for the calculations we are going to perform in the next section of the article.

You should be aware of the form of our training data by now. We will create a new property named __trainingData which is an array in our OcrKNN class. In it, we will push every letter from the provided data. Once again, we are aiming for a flatter structure. The output should look like this:

[
  { clss: 'a', data: [ 0, 0.1, 1, ... ] },
  { clss: 'a', data: [ 0, 0.1, 1, ... ] },
  { clss: 'a', data: [ 0, 0.1, 1, ... ] },
  { clss: 'b', data: [ 0, 0.1, 1, ... ] },
  { clss: 'b', data: [ 0, 0.1, 1, ... ] },
  ...
]
Enter fullscreen mode Exit fullscreen mode

And the method implementation:

export class OcrKNN {
  // ...

  train(trainingData) {
    this.__trainingData = [];

    // Go through every property of the training data object (i.e. "a", "b", etc.)
    Object.keys(trainingData).forEach((clss) => {
      // Iterate through every test letter from the current class
      trainingData[clss].forEach((l) => {
        // Format the [{ x, y }, ...] letters
        // to a flat array of [0, 0.1, 1, ...]
        // and then push it to the training set
        // in a { className, flatArray } form
        this.__trainingData.push({
          clss,
          data: this.__format(l)
        });
      });
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

Note: clss means "class" but since it is a keyword in JavaScript, we will use the version without vowels.

Calculating the distances

It's this part of the article that should clear a lot of things up for you. We already implemented the train method, so we are left only with the testing part, where most of the "magic" happens.

Let's start with going back to our analytic geometry classes (if you haven't taken these, don't worry). In the beginning of our article, we talked about "Euclidean space". Now, considering that we have "distance" in the title of the section, mentioned "analytic geometry", and "Euclidean space", you might realize that what's next is introducing a formula ... and you'll be correct! We are going to use the Euclidean distance formula, which is:

2D distance formula

where p and q are the points between which we want to calculate the distance.

However, this formula won't really help us—we don't have two points or anything like that. Anyway, it was a good starting point. What we actually need is to go beyond the 2-dimensional space of these two dots. We need an n-dimensional space:

ND distance formula

where p and q can be represented as n-tuples.

At this point, you might be frightened, but you shouldn't be. Do you remember that our letters were composed from 20 points, and then we flattened this array, respectively, ending with an array that has 40 entries? Well, what we are going to work with is a 40-dimensional space. And, yes—you don't have to imagine it. We will have to calculate the distances from our input to every other letter in our 40-space in pursuit of the scalar values that will determine the output of this algorithm. Hopefully, at this point, the flattening part of the data preparation should make sense to you. Let's take a look at the code:

export class OcrKNN {
  // ...

  test(data) {
    // Format training data
    data = this.__format(data);
    const distances = [];

    // Iterate through every letter from the training set
    this.__trainingData.forEach((l) => {
      let sum = 0;

      // Calculate the distance via the Euclidean distance formula
      // Note: having similar dot order is crucial
      // for the outcome of this calculation hence
      // why we sorted the data!
      for (let i = 0; i < data.length; i += 1) {
        sum += (data[i] - l.data[i]) * (data[i] - l.data[i]);
      }

      // Push the calculated distance
      distances.push({
        clss: l.clss,
        dist: Math.sqrt(sum)
      });
    });

    // ...
  }
}
Enter fullscreen mode Exit fullscreen mode

It's apparent that the first step is to format our input/test data just like we did with our training data. After that, we are just iterating though all available example letters and calculating the distance of the test letter we want to classify. In the end, the distances array should contain all distances with their respective class. The last step is to aggregate this data so that we find the k nearest neighbors.

export class OcrKNN {
  // ...

  test(data) {
    // ...

    return distances
      .sort((a, b) => a.dist - b.dist) // Sort the distances in DESC order
      .map((d) => d.clss) // Map the output to an array with class names only
      .slice(0, this.__k) // Take the first K elements
      .reduce((map, lett) => { // Create a map in the format [[CLASS_NAME, OCCURRENCES], ...]
        let added = false;
        for (let i = 0; i < map.length; i += 1) {
          if (map[i][0] === lett) {
            map[i][1] += 1;
            added = true;
          }
        }
        if (!added) {
          map.push([lett, 1]);
        }
        return map;
      }, [])
      .sort((a, b) => b[1] - a[1]) // Sort the map by occurrence number in DESC order
      .shift() // Get the first map element
      .shift(); // Return the key of the element (i.e. the class)
  }
}
Enter fullscreen mode Exit fullscreen mode

We are done with the algorithm!

Tying it all together

Let's move back to our app; we would like to create an instance of OcrKNN, set a k, provide training/example data for classification, and finally create a test letter for classification. Let's use a <button id="test"> in order to trigger the k-NN and a <div id="result"> where we can show the result:

import { Letters } from './letters.js';

const K = 3;
const data = []; // Array that contains the user input (i.e. dots/points of the test letter)

function initTestBtn() {
  const knn = new OcrKNN(K, Letters);

  document.getElementById('test')
    .addEventListener('click', () => {
      const result = knn.test(dots);
      resultEl.innerText = `The letter is "${result}"`;
    });
}
Enter fullscreen mode Exit fullscreen mode

Due to the small number of example letters that we have, we are going to pick a small odd k. In our case, 3 should do the job.

The only remaining thing now is to test our completed app!

The app

We should expect relatively correct test output. Don't be amazed, however, if your letter is recognized as a different one. In my experience, the letter "c" is sometimes confused for an "a". Anyway, as we said earlier, we would need a significantly larger training dataset (along with a good k) in order to improve and granulate the accuracy of our algorithm.

All of the code used in this article can be found on GitHub.

Conclusion

Hopefully, this example of a primitive OCR gave you a perspective on how k-NN could be used in practice. However, as you might've guessed, the major downside of this classification algorithm is the potentially weak performance and efficiency—we are forced to calculate all distances in order to classify an object, which might be a slow process when our training/example dataset grows. Still, its simplicity makes it a great tool when used appropriately!

This Dot Inc. is a consulting company which contains two branches : the media stream and labs stream. This Dot Media is the portion responsible for keeping developers up to date with advancements in the web platform. In order to inform authors of new releases or changes made to frameworks/libraries, events are hosted, and videos, articles, & podcasts are published. Meanwhile, This Dot Labs provides teams with web platform expertise using methods such as mentoring and training.

Top comments (1)

Collapse
 
anuragvohraec profile image
Anurag Vohra

wow simply wow.
thank you so much for this detailed explanation.