The first step in diving into any new project is creating a mental list of steps to take to gradually build the first version. After the initial prototype is done, polishing and finalizing it is pretty easy (as long as you don't rework any foundational components). I had just about no knowledge of computer vision algorithms prior to creating my document scanner, so I started with a high-level plan and broke each step into multiple smaller tasks that I could tackle one at a time. I thought the process would go something like this:

- Get an image containing a document from the user
- Find the document in the image
- Transform the perspective so that the document fills the full rectangular region of a new image

If you saw the first part of this series, you'll remember how we visualized these steps.

With this plan in mind, I began my research. As I would soon discover, these steps vary dramatically in their difficulty. Step 1 is trivial, and I had a working image selection UI by the end of my first day working on the project. Step 3 is complex but relatively straightforward: this excellent Stack Exchange answer even provided a rudimentary implementation of perspective transformation in JavaScript, which I would modify lightly to use in my prototype. However, step 2 is incredibly difficult and needs to be broken down into several smaller components.

Initially, I thought the easiest way to find a document in an image would be to find the four most corner-like points in the image and take those to be the corners of the actual document (which I assumed to be a rectangle). This led me down a wild-goose-chase involving Harris corner detection and contour detection, but after finding no success in my hacked-together implementations, I tried researching on a higher level.

I eventually found this post from Dropbox, which gave me an overview of the current state-of-the-art techniques for document detection. Instead of searching for four corners, my program would find all the edges in the image, then look for the four of them most likely to be the edges of the document. More specifically, I would need to devise a scoring function to rank all the combinations of four edges and use the combination with the highest score in my perspective transformation code.

I devised a few improvements over Dropbox's techniques. They used the Canny edge detection algorithm to create a visual representation of the edge-like regions in the image, then applied a Hough transform to that output to find the mathematical representations of the most likely edges in the image.

Instead, I opted to use only the first step of Canny, the Sobel operator, and the gradient direction it generated (which is usually treated as a side-effect) to reduce the number of votes in Hough space. This change dramatically improves performance (I'm estimating by 5x or more) and reduces the amount of noise that appears in the lines detected via the Hough transformation.

Dropbox also checked all combinations of four edges, including those that were geometrically impossible to be a document (for instance, where two "sides" of the paper cross each other and make an hourglass shape instead of a quadrilateral) and filtered out those impossible shapes afterwards. I only considered each combination of four lines that made a valid quadrilateral, which also improves performance a bit, but more importantly makes it easier to design an appropriate scoring function by reducing the scope of the input it has to deal with.

Lastly, I opted to downscale the images before applying all of these algorithms because doing so reduces the chance that text inside the document causes issues during edge detection, and because it improves performance quadratically with respect to the scaling factor while having a theoretical maximum impact of the scaling factor on the location of each edge. In simpler terms, reducing the width and height of the image by 5x would improve performance by 25x but at worst would cause the edges detected to be offset by 5 pixels compared to their true locations, and when the input images are usually at least 1080p, that small error is not noticeable in the final image after projective transformation.

After finishing my research, my revised plan was as follows:

- Get an image containing a document from the user
- Find the document in the image
- Convert the image into a downscaled, grayscale version
- Apply Gaussian blur to reduce noise
- Use the Sobel operator to find the gradient magnitude and direction at each pixel
- Use the Hough transformation to find the score for each possible line that passes through the image. Bucket the angle of each line into roughly 1 degree increments from 0 to 180 degrees, and the position into 2 pixel increments from the negative to the positive value of the hypotenuse of the dimensions of the image
- Use the gradient direction from the Sobel operator to add more weight in the Hough transformation to edges nearly orthogonal to the gradient at each pixel
- Find the top few thousand lines in the Hough transformation and apply non-maximum suppression to find a few dozen lines that have the greatest final score
- Sift through every combination of four lines that make valid quadrilaterals and apply a heuristic scoring function to find the candidate most likely to be the document
- Find the intersections of the lines in the best candidate to find the four corners of the document

- Use a projective transformation to warp the perspective of the original photo into the final image
- Compute a projective transformation: use some matrix algebra to solve linear equations that map the coordinates of the corners of the document into basis vectors representing homogeneous coordinates
- Do the same in reverse to map the homogeneous coordinates to 2D coordinates onto a flat, rectangular plane representing the document from a head-on view (and thus the final image)
- Iterate over every destination coordinate in the projected image and find the source coordinate from the original RGB image (which will likely consist of decimals and not integers)
- Use bilinear interpolation to simulate the pixel values at the decimal source coordinates and use those values at the destination coordinates to construct the projected image

If some of that flew over your head, don't worry; I'm only writing this description after I've finished the project and have struggled through the math behind each of these algorithms. We'll go into more depth about how each step works in the next article, starting with the Sobel operator.

## Discussion (0)