DEV Community

abbazs
abbazs

Posted on

Reconstructing Shopping Bill Dataset from OCR Data using Bash

Introduction

In this tutorial, we'll explore a Bash script that reconstructs a Shopping Bill dataset from OCR (Optical Character Recognition) data files. This is a common problem in data processing where we need to merge and clean data from multiple sources to create a coherent dataset. We'll dive into the problem, the solution, and explain the various Bash tools and techniques used.

Problem Statement

We have a set of OCR data files containing information about shopping bills. These files include details such as shop names, customer names, items purchased, categories, quantities, prices, and costs. Our task is to reconstruct this data into individual shopping bill files, each containing the complete information for a single transaction.

The main challenges are:

  1. Extracting relevant information from multiple files
  2. Dealing with potential OCR errors
  3. Formatting the data correctly
  4. Generating individual files for each shopping bill

Inputs are available at IITM Google Drive

The Solution

Let's break down the solution into its main components and explain the tools and techniques used.

1. The extract_data Function

extract_data() {
    local seq=$1
    local file=$2
    local ps=".*${seq}"
    local pe=".*$((seq + 1))"

    sed -n "
    /^\[File: ${ps}\.jpg\]/{
        n;n;
        :a;
        /^\[File: ${pe}\.jpg\]/q;
        p;n;
        ba;
    }
" "$file"
}
Enter fullscreen mode Exit fullscreen mode

This function uses sed to extract data for a specific record from a file. Here's how it works:

  • It takes two parameters: the sequence number of the record and the file to search.
  • It uses sed with the -n option, which suppresses automatic printing of pattern space.
  • The sed script searches for a pattern matching the start of the desired section.
  • It then enters a loop, printing lines until it finds the start of the next section.

Key sed commands used:

  • /pattern/: Searches for a pattern
  • n: Moves to the next line
  • :label: Defines a label
  • p: Prints the current line
  • q: Quits the script
  • ba: Branches (jumps) to a label

2. The clean_empty_lines Function

clean_empty_lines() {
    sed '/^\s*$/d' $1
}
Enter fullscreen mode Exit fullscreen mode

This simple function uses sed to remove empty lines from a file:

  • /^\s*$/d: Matches lines that contain only whitespace and deletes them.

3. The Main script Function

script() {
    local input_path=$1
    local max=$(sed -n '$s/[^0-9]*//gp' $input_path/ocr_seqno.txt)
    local names=($(sed -e 's/^\[.*\]//g' -e '/^\s*$/d' -e 's/.*\s//g' $input_path/ocr_names.txt))
    mapfile -t stores < <(sed -e 's/^\[.*\]//g' -e '/^\s*$/d' $input_path/ocr_shopname.txt)
    mapfile -t items < <(sed -e '/^\[File:/d' -e '/^Item$/!{H;d}' -e '/^Item$/{x;s/^\n//;s/\n/; /g;s/;[[:space:]]*;/;/g;s/^[[:space:]]*//;s/^;//;s/[[:space:]]*$//;s/^[[:space:]]*//;/./p;s/.*//;h}' -e '$ {x;s/^\n//;s/\n/; /g;s/;[[:space:]]*;/;/g;s/^[[:space:]]*//;s/^;//;s/[[:space:]]*$//;s/^[[:space:]]*//;/^[[:space:]]*$/d;/./p}' -e '/^[[:space:]]*$/d' $input_path/ocr_items.txt)
    local categories=()

    # Data cleanup
    names[9]=Rajesh
    names[22]=Julia

    # Print the formatted data
    for ((i = 0; i < $max; i++)); do
        printf "%s:%s:%d\n" "${stores[$i]}" "${names[$i]}" $((i + 1))
    done

    echo ${#items[@]}
    for ((i = 0; i < ${#items[@]}; i++)); do
        printf "%02d:%s\n" $i "${items[$i]}"
    done
}
Enter fullscreen mode Exit fullscreen mode

This function is the heart of our script. Let's break it down:

  1. It uses sed to extract the maximum sequence number from ocr_seqno.txt.
  2. It processes ocr_names.txt and ocr_shopname.txt to extract names and store names.
  3. It uses a complex sed command to process ocr_items.txt and extract items.
  4. It performs some data cleanup to correct OCR errors.
  5. Finally, it formats and prints the data.

Key techniques used:

  • sed for text processing
  • mapfile to read lines into an array
  • Bash parameter expansion for default values
  • Bash arrays for storing multiple values
  • Bash for loops for iterating over data

Explanation of Tools and Options

sed (Stream Editor)

sed is a powerful text processing tool. In this script, we use several sed options and commands:

  • -n: Suppresses automatic printing of pattern space.
  • -e: Allows multiple editing commands.
  • s///: Substitution command.
  • /pattern/: Pattern matching.
  • d: Delete command.
  • p: Print command.
  • q: Quit command.
  • H: Append pattern space to hold space.
  • x: Exchange contents of pattern and hold space.

mapfile

mapfile (also known as readarray) is used to read lines from standard input into an array. We use it with process substitution (< <(...)) to populate arrays directly from command output.

Bash Parameter Expansion

In the last line of the script, we use ${1:-.} and ${2:-.}. This is parameter expansion with a default value. If the parameter is unset or null, the default value (in this case, . for the current directory) is used.

Conclusion

This script demonstrates how to use bash for text processing and data manipulation tasks. By combining tools like sed with Bash's built-in features, we can create robust solutions for complex data processing problems.

The key takeaways from this tutorial are:

  1. Use sed for complex text processing tasks.
  2. Utilize Bash arrays and loops for handling multiple data items.
  3. Employ functions to modularize your code and improve readability.
  4. Use Bash parameter expansion for flexible script arguments.

Remember, when dealing with OCR data, always account for potential errors and include data cleaning steps in your scripts.

Top comments (0)