Tarana Murtuzova for API4AI

Posted on Jul 15

Efficient Driver's License Recognition with OCR API: Step-by-Step Tutorial

#ocr #python #api4ai #opencv

Introduction

Optical Character Recognition (OCR) technology has transformed the way we convert various document types—such as scanned paper documents, PDFs, or digital camera images—into editable and searchable data. OCR is vital for automating data entry, enhancing accuracy, and saving time by removing the need for manual data extraction. Its applications are widespread across industries like banking, healthcare, logistics, and government services, making it an indispensable tool in the digital transformation era.

In this guide, we will concentrate on a specific OCR use case: recognizing and extracting information from driver's licenses. This capability is essential for businesses and organizations that need to verify identities, such as car rental companies, financial institutions, and security agencies. Automating this process with OCR can greatly improve operational efficiency, reduce human error, and streamline customer interactions.

For this tutorial, we will utilize the API4AI OCR API, a powerful and adaptable solution known for its high accuracy and performance in general OCR tasks. API4AI was selected for its user-friendliness, extensive documentation, and cost-effectiveness. It offers a flexible API that can be integrated into various applications to perform OCR on different document types, including driver's licenses. However, you are welcome to use any other tools, using this guide as a reference.

One of the primary reasons for choosing a general OCR API like API4AI, instead of specialized solutions designed exclusively for driver's license recognition, is cost-efficiency. Specialized solutions often come with higher costs and less flexibility, which can be a significant burden, particularly for small to medium-sized businesses. By using a general OCR API, you can achieve similar results at a lower cost while maintaining the flexibility to adapt the solution for other OCR needs as well.

In the following sections, we will walk you through setting up your environment, integrating the API4AI OCR API, and writing the necessary code to recognize and extract information from driver's licenses. Whether you're a developer looking to add OCR capabilities to your application or a business owner aiming to automate identity verification, this step-by-step tutorial will equip you with the knowledge and tools to get started.

Understanding OCR and Its Applications

Defining Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is a technology designed to convert various document formats—such as scanned paper documents, PDFs, or images—into editable and searchable text data. OCR algorithms examine the visual patterns of characters within these documents and convert them into machine-readable text, enabling computers to comprehend and process the content. OCR has become a crucial tool in the digitization of information, facilitating automation and optimizing workflows across multiple industries.

Common Uses of OCR Technology

OCR technology is utilized in various industries and scenarios, including:

Document Digitization: Transforming physical documents into digital formats for easier storage, retrieval, and distribution.
Data Entry Automation: Streamlining data entry by extracting text from documents and inputting it into databases or other systems.
Text Recognition in Images: Identifying text within images taken by digital cameras or smartphones, such as signs, labels, or handwritten notes.
Translation Services: Facilitating the translation of printed or handwritten text from one language to another.
Accessibility: Making printed materials accessible to visually impaired individuals by converting them into text-to-speech or braille formats.

Specific Applications in Driver's License Recognition

Driver's license recognition is a specialized application of OCR technology that involves extracting key information from driver's licenses, such as the holder's name, license number, date of birth, and address. This data is often required for identity verification in various sectors, including:

Car Rental Services: Confirming the identity of customers before renting vehicles to ensure compliance with age restrictions and driving eligibility.
Financial Institutions: Verifying customer identities for account openings, loan applications, or financial transactions.
Government Agencies: Efficiently processing driver's license renewals, registrations, and other administrative tasks.
Security and Access Control: Providing access to restricted areas or sensitive information based on verified identities.

The Importance of Selecting the Right OCR API for the Task

For tasks like driver's license recognition and other OCR applications, selecting the right OCR API is essential to ensure precise and dependable results. Key considerations when choosing an OCR API include:

Accuracy: The OCR engine's capability to accurately recognize text, even under challenging conditions such as low-quality images or distorted text.
Speed: The OCR API's processing speed, which is critical when handling large volumes of documents or real-time applications.
Ease of Integration: The simplicity and flexibility of incorporating the OCR API into existing applications or workflows.
Language Support: The ability to support multiple languages and character sets, particularly for applications in multilingual environments.
Cost: The pricing structure of the OCR API, including any usage-based fees or subscription plans, and its affordability for the intended purpose.

By thoroughly assessing these factors and selecting a dependable OCR API like API4AI, you can ensure the success of your driver's license recognition project, enhancing efficiency, accuracy, and cost-effectiveness.

Why Opt for General OCR APIs Over Specialized Solutions for Driver's License Recognition?

Overview of Specialized Solutions for Driver's License Recognition

Specialized solutions for driver's license recognition are engineered specifically to extract and verify information from driver's licenses. They often come with pre-configured templates and algorithms optimized for various license formats, making them appear convenient for businesses needing high accuracy and rapid deployment. These solutions generally include features such as automatic format detection, advanced data extraction, and integration with identity verification services.

Examining the High Costs of Specialized Solutions

While specialized solutions provide convenience and high accuracy, they come with considerable drawbacks, mainly in terms of expense. These solutions often entail:

High Licensing Fees: Specialized software usually comes with steep upfront licensing costs or subscription fees, which can be prohibitively expensive for small to medium-sized businesses.
Per-Transaction Costs: Many specialized solutions charge based on the number of transactions or scans, causing costs to escalate as the volume of processed licenses increases.
Maintenance and Support Fees: Ongoing expenses for software maintenance, updates, and support can accumulate, further raising the total cost of ownership.
Vendor Lock-In: Businesses might become dependent on a single vendor, limiting their flexibility to switch to alternative solutions without incurring additional costs or experiencing significant disruptions.

Advantages of Utilizing General OCR APIs for Driver's License Recognition

Opting for a general OCR API, such as API4AI, for driver's license recognition offers numerous benefits compared to specialized solutions:

Cost-Effectiveness: General OCR APIs usually feature lower upfront costs and more adaptable pricing models, including pay-as-you-go options. This makes them more budget-friendly, particularly for businesses with fluctuating processing volumes.
Flexibility and Customization: General OCR APIs allow for significant adaptability and customization of the OCR process to meet specific requirements. Developers can fine-tune the data extraction process, implement custom validation rules, and integrate with other systems without being restricted by the constraints of a specialized solution.
Scalability: General OCR APIs are built to handle a diverse range of document types and can easily scale with the growth of a business. As the volume of processed licenses increases, the solution can be expanded without major changes to the underlying infrastructure.

By harnessing the capabilities of general OCR APIs, these organizations realized substantial cost reductions, enhanced efficiency, and retained the ability to adjust their solutions as their requirements changed. This underscores the effectiveness of general OCR solutions in practical applications, supporting their use for driver's license recognition tasks.

Coding for Driver's License Recognition with API4AI OCR

Assumptions

In this guide, we will delve into using the API4AI OCR API to extract essential information from a driver’s license. By leveraging OCR technology, we can automate data extraction, enhancing efficiency and minimizing the risk of human error. To keep this tutorial focused and manageable, we will use a sample driver’s license from Washington, D.C., and concentrate on extracting the ID and the name of the license holder. This approach will help us demonstrate the process clearly and effectively. However, the principles and techniques discussed can be applied to driver’s licenses from any US state. By the end of this guide, you should have a solid grasp of how to integrate and use the API4AI OCR API for driver's license recognition in your projects.

For this demonstration, we will utilize the demo API endpoint provided by API4AI, which allows a limited number of queries. This will be sufficient for our experimental purposes, enabling us to showcase the OCR API’s capabilities without incurring any costs. For a full-featured solution in a production environment, please refer to the API4AI documentation for detailed instructions on obtaining an API key and exploring the full range of available features.

For testing and development, we will use the image below.

Understanding the API4AI OCR API

The API4AI OCR API operates in two modes: "simple_text" *(default) and *"simple_words". The "simple_text" mode generates text with recognized phrases separated by line breaks and their positions. This mode isn't our focus at the moment because we need to determine the location of each word to have a reliable fallback. Before diving in, it's essential to understand how the API functions. As the saying goes, a single code example is worth more than a thousand words.



import math
import sys

import cv2
import requests

API_URL = 'https://demo.api4ai.cloud/ocr/v1/results?algo=simple-words'


# get path from the 1st argument
image_path = sys.argv[1]

# we us HTTP API to get recognized words from the specified image
with open(image_path, 'rb') as f:
    response = requests.post(API_URL, files={'image': f})
json_obj = response.json()

for elem in json_obj['results'][0]['entities'][0]['objects']:
    box = elem['box']  # normalized x, y, width, height
    text = elem['entities'][0]['text']  # recognized text
    print(  # show every word with bounding box
        f'[{box[0]:.4f}, {box[1]:.4f}, {box[2]:.4f}, {box[3]:.4f}], {text}'
    )

In this brief code snippet, we interact with the API by sending an image via a POST request, with the image path provided as the first command-line argument. This script will display the normalized values of the top-left coordinate, the width, and the height of the area containing the recognized word, along with the word itself. Below is an output fragment for the provided image:



...
[0.6279, 0.6925, 0.0206, 0.0200], All
[0.6529, 0.6800, 0.1118, 0.0300], 02/21/1984
[0.6162, 0.7175, 0.0309, 0.0200], BEURT
[0.6515, 0.7350, 0.0441, 0.0175], 4a.ISS
[0.6515, 0.7675, 0.1132, 0.0250], 02/17/2010
[0.7662, 0.1725, 0.0647, 0.1125], tomand
[0.6529, 0.8550, 0.0324, 0.0275], ♥♥
[0.6941, 0.8550, 0.0809, 0.0275], DONOR
[0.6529, 0.8950, 0.1074, 0.0300], VETERAN
[0.9000, 0.0125, 0.0691, 0.0375], USA

Let's use the extracted data to draw bounding boxes on an image with OpenCV. To accomplish this, we need to convert the normalized values into absolute values represented in integer pixels. We require the exact coordinates of the upper left and lower right corners to draw the bounding box accurately. To do this, let's create a function called get_corner_coords.



def get_corner_coords(height, width, box):
    x1 = int(box[0] * width)
    y1 = int(box[1] * height)
    obj_width = box[2] * width
    obj_height = box[3] * height
    x2 = int(x1 + obj_width)
    y2 = int(y1 + obj_height)
    return x1, y1, x2, y2

The function to draw the bounding box will be straightforward:



def draw_bounding_box(image, box):
    x1, y1, x2, y2 = get_corner_coords(image.shape[0], image.shape[1], box)
    cv2.rectangle(image, (x1 - 2, y1 - 2), (x2 + 2, y2 + 2), (127, 0, 0), 2)

In this feature, we slightly increased the frame size by two pixels to ensure it isn't too close to the words. The color (127, 0, 0) represents navy blue in BGR format, and the frame's thickness is set to two pixels.

Naturally, to manipulate an image, it must first be read. Let's update the final part of our script: read the image, remove the debug output containing frame information, draw each bounding box on the read image, and save the modified image as "output.png".



image = cv2.imread(image_path)
for elem in json_obj['results'][0]['entities'][0]['objects']:
    box = elem['box']  # normalized x, y, width, height
    text = elem['entities'][0]['text']  # recognized text
    draw_bounding_box(image, box)  # add boundaries to image
cv2.imwrite('output.png', image)

And what do we have now:

Extracting the License Holder's ID and Name

Previously, we successfully used the API to extract text information from a driver's license image. That's a great start! But how do we specifically retrieve the ID number and the name?

Here are the elements within the area of interest:



[0.3059, 0.1975, 0.0500, 0.0175], 4d.DLN
[0.3059, 0.2325, 0.1059, 0.0275], A9999999
[0.3074, 0.2800, 0.0603, 0.0200], 1.FAMILY
[0.3735, 0.2800, 0.0412, 0.0175], NAME
[0.3059, 0.3150, 0.0794, 0.0300], JONES
[0.3059, 0.3675, 0.0574, 0.0225], 2.GIVEN
[0.3691, 0.3675, 0.0529, 0.0225], NAMES
[0.3074, 0.4025, 0.1191, 0.0275], ANGELINA
[0.3074, 0.4375, 0.1191, 0.0300], GABRIELA

Although the POST request returned ordered results, the order may vary, so we can't depend on it. It's safer to assume that the results store the recognized elements in a random manner.

Let's create a list named words to easily search for words and their positions:



words = []
for elem in json_obj['results'][0]['entities'][0]['objects']:
    box = elem['box']
    text = elem['entities'][0]['text']
    words.append({'box': box, 'text': text})

Let's refer to "4d.DLN," "1.FAMILY," and "2.GIVEN" as the field names, and the text below them in the image as the field values. The simplest method is to search for the closest elements situated below the field names. We might encounter words far to the right or left, so we should evaluate the distance between the text elements instead of their positions relative to the axes. Let's write some code for this.

First, let's identify the positions of the field names:



ID_MARK = '4d.DLN'
FAMILY_MARK = '1.FAMILY'
NAME_MARK = '2.GIVEN'

id_mark_info = {}
fam_mark_info = {}
name_mark_info = {}

for elem in words:
    if elem['text'] == ID_MARK:
        id_mark_info = elem
    elif elem['text'] == FAMILY_MARK:
        fam_mark_info = elem
    elif elem['text'] == NAME_MARK:
        name_mark_info = elem

Next, we will write a function to locate the closest elements positioned below a given reference element:



def find_label_below(word_info):
    x = word_info['box'][0]
    y = word_info['box'][1]
    candidate = words[0]
    candidate_dist = math.inf
    for elem in words:
        if elem['text'] == word_info['text']:
            continue
        curr_box_x = elem['box'][0]
        curr_box_y = elem['box'][1]
        curr_vert_dist = curr_box_y - y
        curr_horiz_dist = x - curr_box_x
        if curr_vert_dist &gt; 0:  # we are only looking for items below
            dist = math.hypot(curr_vert_dist, curr_horiz_dist)
            if dist &gt; candidate_dist:
                continue
            candidate_dist = dist
            candidate = elem
    return candidate

Let's apply this function and draw the boundaries around the identified elements:



id_info = find_label_below(id_mark_info)
fam_info = find_label_below(fam_mark_info)
name_info = find_label_below(name_mark_info)
name2_info = find_label_below(name_info)
canvas = image.copy()
draw_bounding_box(canvas, id_info['box'])
draw_bounding_box(canvas, fam_info['box'])
draw_bounding_box(canvas, name_info['box'])
draw_bounding_box(canvas, name2_info['box'])
cv2.imwrite('result.png', canvas)

Let's review what we have achieved so far:

It appears that we have successfully extracted the necessary fields! 😊

Finalizing the Results

Given all we've covered, let's develop a practical program that doesn't rely on OpenCV. This program will accept the image path as an argument and display the ID number and full name in the terminal.



#!/usr/bin/env python3

import math
import sys

import requests

API_URL = 'https://demo.api4ai.cloud/ocr/v1/results?algo=simple-words'

ID_MARK = '4d.DLN'
FAMILY_MARK = '1.FAMILY'
NAME_MARK = '2.GIVEN'
ADDRESS_MARK = '8.ADDRESS'


def find_text_below(words, word_info):
    x = word_info['box'][0]
    y = word_info['box'][1]
    candidate = words[0]
    candidate_dist = math.inf
    for elem in words:
        if elem['text'] == word_info['text']:
            continue
        curr_box_x = elem['box'][0]
        curr_box_y = elem['box'][1]
        curr_vert_dist = curr_box_y - y
        curr_horiz_dist = x - curr_box_x
        if curr_vert_dist &gt; 0:  # we are only looking for items below
            dist = math.hypot(curr_vert_dist, curr_horiz_dist)
            if dist &gt; candidate_dist:
                continue
            candidate_dist = dist
            candidate = elem
    return candidate


if __name__ == '__main__':
    if len(sys.argv) != 2:
        print('Expected one argument: path to image.')
        sys.exit(1)
    image_path = sys.argv[1]
    with open(image_path, 'rb') as f:
        response = requests.post(API_URL, files={'image': f})
    json_obj = response.json()
    words = []
    for elem in json_obj['results'][0]['entities'][0]['objects']:
        box = elem['box']
        text = elem['entities'][0]['text']
        words.append({'box': box, 'text': text})

    id_mark_info = {}
    fam_mark_info = {}
    name_mark_info = {}

    for elem in words:
        if elem['text'] == ID_MARK:
            id_mark_info = elem
        elif elem['text'] == FAMILY_MARK:
            fam_mark_info = elem
        elif elem['text'] == NAME_MARK:
            name_mark_info = elem

    license = find_text_below(words, id_mark_info)['text']
    family_name = find_text_below(words, fam_mark_info)['text']
    name1_info = find_text_below(words, name_mark_info)
    name1 = name1_info['text']
    name2 = find_text_below(words, name1_info)['text']

    if name2 == ADDRESS_MARK:  # no second name
        full_name = f'{name1} {family_name}'
    else:  # with second name
        full_name = f'{name1} {name2} {family_name}'

    print(f'Driver license: {license}')
    print(f'Full name:      {full_name}')

The program's output for the image provided at the beginning, given as the first argument:



License:   A9999999
Full name: ANGELINA GABRIELA JONES

This program can be easily modified to extract additional data from driver's licenses. While we didn't address all potential issues, as the primary goal was to demonstrate the practical application of the API, there is plenty of room for the reader to make improvements. For instance, to handle rotated images, you could calculate the rotation angle from the key fields and use that information to locate the "underlying" elements with the field values. Give it a try! Using these general concepts, you can implement logic for other types of documents and text-containing images.

To learn more, refer to the OCR API documentation and explore code examples written in various programming languages.

Conclusion

In this tutorial, we've guided you through the step-by-step process of using the [API4AI OCR API](https://api4.ai/apis/ocr to recognize and extract information from a US driver’s license. We started by understanding the basics of OCR technology and its diverse applications. We then discussed the advantages of using a general OCR API over specialized solutions, emphasizing cost-effectiveness, flexibility, and scalability.

Throughout the tutorial, we wrote code to send an image to the API, extract the ID number and name from the license, and efficiently handle the OCR results. We also showed how to parse and validate the extracted data and discussed ways to extend the program to retrieve additional information.

Using OCR for driver's license recognition provides numerous benefits. It automates data extraction, reducing manual effort and minimizing errors, which can significantly enhance operational efficiency in industries like car rentals, financial institutions, and government agencies. Moreover, the adaptability of general OCR APIs allows for customization and application to various document types and use cases.

We encourage you to explore further applications of OCR technology beyond driver's license recognition. OCR can be applied to a wide range of documents and scenarios, from digitizing printed texts to automating form processing and enhancing accessibility. By leveraging OCR, you can streamline workflows, improve accuracy, and unlock new opportunities for innovation in your projects.

Thank you for following along with this tutorial. We hope you found it informative and useful. For more details and advanced usage, be sure to check out the OCR API documentation and explore additional examples in various programming languages. Happy coding!

Additional Resources

API4AI OCR API Documentation Links

To explore the features and capabilities of the API4AI OCR API in greater detail, refer to the official documentation. It offers comprehensive guides, code examples, and detailed explanations of the API endpoints, helping you make the most of OCR in your applications.

Links to Related Tutorials and Courses

Boost your practical skills and gain hands-on experience with these tutorials and courses focused on OCR and image processing:

Tutorials:

OpenCV-Python Tutorials - Official OpenCV tutorials for Python.
Real Python: OCR with Tesseract and OpenCV - A practical guide to using Tesseract and OpenCV in Python.

Online Courses:

Coursera: Introduction to Computer Vision and Image Processing - A comprehensive course on computer vision and OpenCV.
Udacity: Computer Vision Nanodegree Program - An in-depth program covering various aspects of computer vision.
edX: Computer Vision and Image Processing Fundamentals - A foundational course on computer vision principles and applications.
CS231n: Deep Learning for Computer Vision - A detailed course focusing on deep learning architectures, especially for tasks like image classification.

By delving into these supplementary resources, you can broaden your knowledge of OCR technology, sharpen your skills, and uncover innovative methods to apply OCR in your projects. Enjoy your learning journey!

More about Web, Cloud, AI and APIs for Image Processing

DEV Community