DEV Community 👩‍💻👨‍💻

Cover image for Smarter document management using OCR
Ilya Nevolin
Ilya Nevolin

Posted on

Smarter document management using OCR

Many businesses (including my own) suffer from unproductive processes, such as manual data processing. These issues can be solved through automation, using structural systems such as CRM and custom tools. Throughout the years I've dealt with complex environments that require a lot of data processing, analysis and reporting. And "data" can mean anything that's digital.

Some time ago I faced a client who had thousands of unstructured documents, these piled up throughout the years. And it has become a very unproductive environment especially when information had to be retrieved, but couldn't happen efficiently. Fortunately technology can help us. OCR stands for Optical Character Recognition, it's a machine learning discipline focusing on extracting text from images/pictures.

Suppose you have hundreds of files, and most of these are copies of passports, contracts and invoices. Some images were made by phone, some were scanned, some are PDF files containing text and/or images. The demo screenshots below illustrate how we can extract text/keywords from these kinds of documents.

ocr passport image

ocr pdf image

Using the extracted text/keywords we can process these files according to our own business rules, such as rename/copy/move/backup; but we can also send/upload these files to some other pipeline for further processing. Keep in mind that OCR is pretty good but it's not perfect, it works best when images are clear and don't contain strange characters. Most languages are supported.

# Basic usage of our OCR library

import ocr

your_file = './demo_files/doc1.pdf'
text = ocr.process(your_file)

# your business rules
if 'CONTRACT' in text:
  ...
else:
  ...
Enter fullscreen mode Exit fullscreen mode

As easy as that, you only need basic python knowledge to get started. For more information visit our Git repository.

https://github.com/healzer/PyCRM

The "PyCRM" project is a collection of useful tools, tips and tricks for your business. These can be used in almost any industry that has some digital processes: managing clients/data, data extraction & analysis, reports, process automation, etc.

Top comments (11)

Collapse
cgcnu profile image
sreenivas

How good is this with grocery receipts ? I have been planning to look into setting up a system to track my grocery receipts to further explore my spending habits.

Collapse
codr profile image
Ilya Nevolin Author

Hey, could you email me a few samples of your grocery receipts, I'll run them through the system for you :)

Collapse
cgcnu profile image
sreenivas

I don't have one right now with me. I usually throw it right at the store, will collect a few and send you next time.

Collapse
unqlite_db profile image
Vincent

PixLab offer Passports & ID Cards scanning capabilities using state of the art PP-OCR algorithm via its DOCSCAN REST API endpoint. You can find more information at blog.pixlab.io/2020/06/passport-do...

Collapse
natelindev profile image
Nathaniel

Good stuff, However I think you should hide the passport details.

Collapse
codr profile image
Ilya Nevolin Author

It's a random passport image I found on Google, no harm I guess

Collapse
cyril_ogoh profile image
ogoh cyril

Taught as much

Collapse
taltrums profile image
Mohd Talha

Interesting Stuff

Collapse
biteniumexchange profile image
biteniumexchange

Does it collect the image and all text from passport and Local ID Card

Collapse
jswhisperer profile image
Greg, The JavaScript Whisperer

Eeeeek don't provide your passport photo online!
Identity theft is a big issue.

Collapse
jswhisperer profile image
Greg, The JavaScript Whisperer

Saw it's someone elses... probably still not good to spread it around? otherwise interesting article and cool tech

⬇️ The only reason people scroll to the bottom...

 

is because they want to read more. Sign up for an account to bookmark, comment, and react to articles that interest you.