Many businesses (including my own) suffer from unproductive processes, such as manual data processing. These issues can be solved through automation, using structural systems such as CRM and custom tools. Throughout the years I've dealt with complex environments that require a lot of data processing, analysis and reporting. And "data" can mean anything that's digital.
Some time ago I faced a client who had thousands of unstructured documents, these piled up throughout the years. And it has become a very unproductive environment especially when information had to be retrieved, but couldn't happen efficiently. Fortunately technology can help us. OCR stands for Optical Character Recognition, it's a machine learning discipline focusing on extracting text from images/pictures.
Suppose you have hundreds of files, and most of these are copies of passports, contracts and invoices. Some images were made by phone, some were scanned, some are PDF files containing text and/or images. The demo screenshots below illustrate how we can extract text/keywords from these kinds of documents.
Using the extracted text/keywords we can process these files according to our own business rules, such as rename/copy/move/backup; but we can also send/upload these files to some other pipeline for further processing. Keep in mind that OCR is pretty good but it's not perfect, it works best when images are clear and don't contain strange characters. Most languages are supported.
# Basic usage of our OCR library import ocr your_file = './demo_files/doc1.pdf' text = ocr.process(your_file) # your business rules if 'CONTRACT' in text: ... else: ...
As easy as that, you only need basic python knowledge to get started. For more information visit our Git repository.
The "PyCRM" project is a collection of useful tools, tips and tricks for your business. These can be used in almost any industry that has some digital processes: managing clients/data, data extraction & analysis, reports, process automation, etc.