First off, I am a web developer who has recently started exploring machine learning domain.
So I am looking for some help/starters/guidelines on how to implement a machine learning based scanned document/image classifier that predicts a document falls into one of 29 categories.
The documents are mostly letters, memos and reports (having tabular data). So far, I have found Tesseract OCR and OpenCV which I think will be the tools needed for this task. I also think I will need to use kind of NLP techniques to extract the meaning and better predict. However, it will be great if someone can dumb it down for me the strategy and route to take for this. What are some of the specific techniques/skills/tools/packages I need to learn? Since the scanned images are of varying quality, what image processing techniques I can employ to get the best results.