Smarter document management using OCR

#beginners #python #computerscience #machinelearning

Many businesses (including my own) suffer from unproductive processes, such as manual data processing. These issues can be solved through automation, using structural systems such as CRM and custom tools. Throughout the years I've dealt with complex environments that require a lot of data processing, analysis and reporting. And "data" can mean anything that's digital.

Some time ago I faced a client who had thousands of unstructured documents, these piled up throughout the years. And it has become a very unproductive environment especially when information had to be retrieved, but couldn't happen efficiently. Fortunately technology can help us. OCR stands for Optical Character Recognition, it's a machine learning discipline focusing on extracting text from images/pictures.

Suppose you have hundreds of files, and most of these are copies of passports, contracts and invoices. Some images were made by phone, some were scanned, some are PDF files containing text and/or images. The demo screenshots below illustrate how we can extract text/keywords from these kinds of documents.

Using the extracted text/keywords we can process these files according to our own business rules, such as rename/copy/move/backup; but we can also send/upload these files to some other pipeline for further processing. Keep in mind that OCR is pretty good but it's not perfect, it works best when images are clear and don't contain strange characters. Most languages are supported.

# Basic usage of our OCR library

import ocr

your_file = './demo_files/doc1.pdf'
text = ocr.process(your_file)

# your business rules
if 'CONTRACT' in text:
  ...
else:
  ...

As easy as that, you only need basic python knowledge to get started. For more information visit our Git repository.

https://github.com/healzer/PyCRM

The "PyCRM" project is a collection of useful tools, tips and tricks for your business. These can be used in almost any industry that has some digital processes: managing clients/data, data extraction & analysis, reports, process automation, etc.

Top comments (11)

sreenivas • Sep 21 '20

How good is this with grocery receipts ? I have been planning to look into setting up a system to track my grocery receipts to further explore my spending habits.

Ilya Nevolin • Sep 21 '20

Hey, could you email me a few samples of your grocery receipts, I'll run them through the system for you :)

sreenivas • Sep 22 '20

I don't have one right now with me. I usually throw it right at the store, will collect a few and send you next time.

Vincent • Dec 3 '21 • Edited

PixLab offer Passports & ID Cards scanning capabilities using state of the art PP-OCR algorithm via its DOCSCAN REST API endpoint. You can find more information at blog.pixlab.io/2020/06/passport-do...