Applying OCR in receipt reading to prevent fraud

Every day, Riders photograph and submit a large number of bills (or receipts) to store on the system according to our Rider Operations process. Currently, this type of information is processed by OCR. From the uploaded images, OCR helps to identify, annotate the bounding boxes, and extract content.

There are many advantages when applying OCR to process bills. On the technical side, OCR helps to systematically store and retrieve data, while from the organizational perspective it also significantly enhances drivers’ compliance with our operating procedure.

So how did we implement OCR? Instead of building a model from scratch, the Data team used PaddleOCR’s algorithms - a toolkit that provides high-quality pretrained models that can be customized easily.

Why use a pretrained model instead of building it from scratch? There are some common challenges in building a model from scratch, such as small dataset, imbalanced dataset, not to mention that it’s not easy to design a model with compatible architecture. Therefore, our Data team used pre-trained models provided by PaddleOCR and then customized it to improve its accuracy, saving costs and time.

The customization of pre-trained models in this use case is called Transfer Learning: applying the knowledge that the model has learned from one field to a new related field. By using Transfer Learning, the Data team has avoided many pitfalls that can occur when training the model in isolation, such as the knowledge that the model has learned is not maintained, accumulated and inherited. Transfer Learning helps the model learn faster, with less training data, higher initial performance (higher start), and higher accuracy.

In this particular problem:

Extract the strings in the photograph
Check if the last 4 digits match the code recorded on the system

The team customized PaddleOCR’s pre-trained model with the real dataset to increase its accuracy. When it reached acceptable performance, the refined model was deployed in production.

Just because a model is in production doesn’t mean the job is done. In fact, no model works well forever, despite our careful and frequent data scrubbing. A common cause of model degradation is data drift. This phenomenon occurs when the input data has a change in distribution, reducing the model’s accuracy. Therefore, it is necessary to continuously monitor the model in production to retrain it with new data or check the appropriate logic.

DEV Community

Applying OCR in receipt reading to prevent fraud

Top comments (0)

Read next

No-code Real-time Object Detection without training models

Selections

How to add or modify a validation rule in Trivule?

How to create a carousel with Tailwind CSS and Alpinejs