Abto Software

Posted on Sep 19, 2022 • Edited on Jul 18, 2025

AI-driven document classification system for a legacy DMS

#community #ai #tutorial #programming

By combining OCR and NLP technology, our team took over the delivery of an AI document recognition system. The platform used throughout the complete construction lifecycle simplifies several construction processes from planning and design to operation and maintenance.

Read on to get an idea of how we used our expertise to build a custom AI document management system, which now serves users in the United Kingdom, Ireland, Australia, and Qatar.

Brief overview

The client is a European company delivering technology to benefit construction and engineering businesses. The solutions they’re designing are used by architects, engineers, housebuilders, and contractors.

Our company was contacted to implement an automated AI document classification system for a legacy DMS. The comprehensive cloud-based solution, compatible with both mobile and desktop, is being actively utilized to optimize operational processes in the construction industry.

Designing an AI-enabled document classification system: Step-by-step guide

There’s one really annoying, nerve-racking thing about transferring the documentation into the corporate DMS. And since it’s often quite tedious to use the document management system, important processes might often be skipped by users, which causes potential risks.

The project has followed several phases that reflect our approach to delivering AI automation.

Phase 0. Dataset analysis

At the first stage of the complex project, we covered document classification by examining:

DOC and XLS files
PDF files
Scanned images (PNG, JPEG, and BPM)
AutoCAD drawings

We assigned three labels to each of the documents provided to us, each label containing from 3 to 18 classes. The dataset included around 14,000 documents per label, what’s about 200 to 11,000 documents per class.

This approach is called the multi-label document classification, where each document has more than one label. In this particular case, each record has had three labels.

Simple example:

A processed DOC file including information about furniture, will get the labels “furniture”, “material”, and “description”
Each label will get its own unique name within the corporate DMS (“type”, ”category”, and ”purpose”)

Phase 1. Text vectorization

One of the most crucial steps in building a document classification system is comprehensive text vectorization. The point of implementing text vectorization is to analyze the text within multiple documents.

To design a custom Parser API, we used Tesseract OCR, which is trained to:

Accurately scan and process text records, schemes and even images
Automatically convert extracted information into readable, easy-to-understand formats

Phase 2. Approach investigation

Carefully researching different congruent AI strategies, we’ve investigated more than a dozen ML approaches. Our team was focused on achieving high accuracy, despite the unbalanced datasets.

After performing careful research of several potentially suitable AI strategies, we chose the best ML approach. We adopted ensemble learning, as it has proven to provide great performance for similar unbalanced datasets. The choice fell on three different classification techniques – an individual ML model for each assigned label. These techniques had the same architecture, but had been trained to identify different datasets.

Phase 3. Algorithm implementation

We built a robust AI document classification API, which receives and converts the results from the Parser API. The implemented classification API provides for 98% accuracy for classification within one single label and for 96% accuracy for classification within all three labels.

Document classification API output structure:

{
"type": "Class", "typeProbability": Accuracy,
"category": "Class", "categoryProbability": Accuracy,
"purpose": "Class", "purposeProbability": Accuracy
}

Document classification API output example:

{
"type": "Doors", "typeProbability": 0.99,
"category": "Material", "categoryProbability": 0.97,
"purpose": "Description", "purposeProbability": 0.98
}

Phase 4. Deployment & data security

The custom AI document classification system is hosted on the Amazon Web Services (AWS) cloud platform. This decision provided multiple business benefits.

By utilizing Amazon Web Services (AWS), our team provided for:

Data security – the solution is compliant with the GDRP regulation
User experience – the client can manage user policies, monitor flows, and efficiently respond to security threats

The structure

The solution consists of two parts:

Parser API. This part performs preprocessing, text extraction, and vectorization
Classification API. This part performs categorization based on the output of the Parser API

The technology

Tech stack:

Python
Scikit-learn
Tesseract OCR
Amazon Web Services (AWS)

Investigated text vectorization methods:

Word2vec
FastText
GloVe
TF-IDF
Universal Sentence Encoder
BERT

Investigated text classification algorithms:

LSTM
GRU
Unidirectional RNN
Bidirectional RNN
SVM
KNN
XGBoost
AdaBoost
Logistic Regression
Decision Trees
Naïve Bayes methods (Gaussian Naïve Bayes, Multinomial Naive Bayes, Categorical Naïve Bayes)

Final words

By implementing OCR and NLP technology, we delivered a custom, AI-enabled document classification service. The solution simplifies the construction lifecycle from planning to operation and maintenance.

The service, seamlessly integrated into a legacy DMS, provides for:

An uncomplicated user journey. Our solution automates routine data entry.
Extensive support. The solution processes both readable and non-readable documents.
Multilabel classification. The system performs classification within three different labels.
High accuracy. We achieved 98% accuracy within one single label along with 96% accuracy within all three labels.
Improved accessibility.
Great scalability.
GDPR compliance. The adopted development and deployment approaches provide for data security.

DEV Community