Introduction
In this tutorial, we'll explore a Bash script that reconstructs a Shopping Bill dataset from OCR (Optical Character Recognition) data files. This is a common problem in data processing where we need to merge and clean data from multiple sources to create a coherent dataset. We'll dive into the problem, the solution, and explain the various Bash tools and techniques used.
Problem Statement
We have a set of OCR data files containing information about shopping bills. These files include details such as shop names, customer names, items purchased, categories, quantities, prices, and costs. Our task is to reconstruct this data into individual shopping bill files, each containing the complete information for a single transaction.
The main challenges are:
- Extracting relevant information from multiple files
- Dealing with potential OCR errors
- Formatting the data correctly
- Generating individual files for each shopping bill
Inputs are available at IITM Google Drive
The Solution
Let's break down the solution into its main components and explain the tools and techniques used.
1. The extract_data
Function
extract_data() {
local seq=$1
local file=$2
local ps=".*${seq}"
local pe=".*$((seq + 1))"
sed -n "
/^\[File: ${ps}\.jpg\]/{
n;n;
:a;
/^\[File: ${pe}\.jpg\]/q;
p;n;
ba;
}
" "$file"
}
This function uses sed
to extract data for a specific record from a file. Here's how it works:
- It takes two parameters: the sequence number of the record and the file to search.
- It uses
sed
with the-n
option, which suppresses automatic printing of pattern space. - The
sed
script searches for a pattern matching the start of the desired section. - It then enters a loop, printing lines until it finds the start of the next section.
Key sed
commands used:
-
/pattern/
: Searches for a pattern -
n
: Moves to the next line -
:label
: Defines a label -
p
: Prints the current line -
q
: Quits the script -
ba
: Branches (jumps) to a label
2. The clean_empty_lines
Function
clean_empty_lines() {
sed '/^\s*$/d' $1
}
This simple function uses sed
to remove empty lines from a file:
-
/^\s*$/d
: Matches lines that contain only whitespace and deletes them.
3. The Main script
Function
script() {
local input_path=$1
local max=$(sed -n '$s/[^0-9]*//gp' $input_path/ocr_seqno.txt)
local names=($(sed -e 's/^\[.*\]//g' -e '/^\s*$/d' -e 's/.*\s//g' $input_path/ocr_names.txt))
mapfile -t stores < <(sed -e 's/^\[.*\]//g' -e '/^\s*$/d' $input_path/ocr_shopname.txt)
mapfile -t items < <(sed -e '/^\[File:/d' -e '/^Item$/!{H;d}' -e '/^Item$/{x;s/^\n//;s/\n/; /g;s/;[[:space:]]*;/;/g;s/^[[:space:]]*//;s/^;//;s/[[:space:]]*$//;s/^[[:space:]]*//;/./p;s/.*//;h}' -e '$ {x;s/^\n//;s/\n/; /g;s/;[[:space:]]*;/;/g;s/^[[:space:]]*//;s/^;//;s/[[:space:]]*$//;s/^[[:space:]]*//;/^[[:space:]]*$/d;/./p}' -e '/^[[:space:]]*$/d' $input_path/ocr_items.txt)
local categories=()
# Data cleanup
names[9]=Rajesh
names[22]=Julia
# Print the formatted data
for ((i = 0; i < $max; i++)); do
printf "%s:%s:%d\n" "${stores[$i]}" "${names[$i]}" $((i + 1))
done
echo ${#items[@]}
for ((i = 0; i < ${#items[@]}; i++)); do
printf "%02d:%s\n" $i "${items[$i]}"
done
}
This function is the heart of our script. Let's break it down:
- It uses
sed
to extract the maximum sequence number fromocr_seqno.txt
. - It processes
ocr_names.txt
andocr_shopname.txt
to extract names and store names. - It uses a complex
sed
command to processocr_items.txt
and extract items. - It performs some data cleanup to correct OCR errors.
- Finally, it formats and prints the data.
Key techniques used:
-
sed
for text processing -
mapfile
to read lines into an array - Bash parameter expansion for default values
- Bash arrays for storing multiple values
- Bash for loops for iterating over data
Explanation of Tools and Options
sed (Stream Editor)
sed
is a powerful text processing tool. In this script, we use several sed
options and commands:
-
-n
: Suppresses automatic printing of pattern space. -
-e
: Allows multiple editing commands. -
s///
: Substitution command. -
/pattern/
: Pattern matching. -
d
: Delete command. -
p
: Print command. -
q
: Quit command. -
H
: Append pattern space to hold space. -
x
: Exchange contents of pattern and hold space.
mapfile
mapfile
(also known as readarray
) is used to read lines from standard input into an array. We use it with process substitution (< <(...)
) to populate arrays directly from command output.
Bash Parameter Expansion
In the last line of the script, we use ${1:-.}
and ${2:-.}
. This is parameter expansion with a default value. If the parameter is unset or null, the default value (in this case, .
for the current directory) is used.
Conclusion
This script demonstrates how to use bash for text processing and data manipulation tasks. By combining tools like sed
with Bash's built-in features, we can create robust solutions for complex data processing problems.
The key takeaways from this tutorial are:
- Use
sed
for complex text processing tasks. - Utilize Bash arrays and loops for handling multiple data items.
- Employ functions to modularize your code and improve readability.
- Use Bash parameter expansion for flexible script arguments.
Remember, when dealing with OCR data, always account for potential errors and include data cleaning steps in your scripts.
Top comments (0)