Scanned Documents Classification using Machine Learning

#machinelearning #ocr #python

First off, I am a web developer who has recently started exploring machine learning domain.

So I am looking for some help/starters/guidelines on how to implement a machine learning based scanned document/image classifier that predicts a document falls into one of 29 categories.

The documents are mostly letters, memos and reports (having tabular data). So far, I have found Tesseract OCR and OpenCV which I think will be the tools needed for this task. I also think I will need to use kind of NLP techniques to extract the meaning and better predict. However, it will be great if someone can dumb it down for me the strategy and route to take for this. What are some of the specific techniques/skills/tools/packages I need to learn? Since the scanned images are of varying quality, what image processing techniques I can employ to get the best results.

Top comments (3)

Vesi Staneva • Jul 3 '20

My team just completed an open-sourced Content Moderation Service built Node.js, TensorFlowJS, and ReactJS that we have been working over the past weeks. We have now released the first part of a series of three tutorials - How to create an NSFW Image Classification REST API that might help you answer some of those questions. Any comments & suggestions are more than welcome. Thanks in advance!
(Fork it on GitHub or click🌟star to support us and stay connected🙌)

rjs417 • Feb 6 '19

I'm also looking for the same.
I'll follow this post

Gyandeep Singh • Sep 29 '18

i am also looking for some insight into this.
Thanks for posting this question.