NoisOCR is a Python library designed to simulate noise in texts generated after Optical Character Recognition (OCR). These texts may contain errors or annotations, reflecting the challenges of handling OCR in low-quality documents or manuscripts. The library offers features that facilitate the simulation of common errors in post-OCR texts and partitioning texts into sliding windows, with or without hyphenation. This can contribute to the training of neural network models for spelling correction.
GitHub Repository: NoisOCR
PyPI: NoisOCR on PyPI
Features
- Sliding windows: Split long texts into smaller segments without breaking words.
- Sliding windows with hyphenation: Use hyphenation to fit words within character limits.
- Simulate text errors: Add random errors to simulate post-OCR low-accuracy texts.
- Simulate text annotations: Insert annotations like those found in the BRESSAY dataset to mark words or phrases in the text.
Installation
You can easily install NoisOCR via pip:
pip install noisocr
Usage Examples
1. Sliding Window
This function divides a text into segments of limited size, keeping the words intact.
import noisocr
text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50
windows = noisocr.sliding_window(text, max_window_size)
# Output:
# [
# 'Lorem Ipsum is simply dummy text of the printing',
# ...
# 'type and scrambled it to make a type specimen',
# 'book.'
# ]
2. Sliding Window with Hyphenation
When using hyphenation, the function attempts to fit words that exceed the character limit per window by inserting hyphens as necessary. This functionality supports multiple languages through the PyHyphen package.
import noisocr
text = "Lorem Ipsum is simply dummy...type specimen book."
max_window_size = 50
windows = noisocr.sliding_window_with_hyphenation(text, max_window_size, 'en_US')
# Output:
# [
# 'Lorem Ipsum is simply dummy text of the printing ',
# 'typesetting industry. Lorem Ipsum has been the in-',
# ...
# 'scrambled it to make a type specimen book.'
# ]
3. Simulating Text Errors
The simulate_errors
function allows users to add random errors to the text, emulating issues commonly found in post-OCR texts. The typo library generates errors, such as character swaps, missing spaces, extra characters, and more.
import noisocr
text = "Hello world."
text_with_errors = noisocr.simulate_errors(text, interactions=1)
# Output: Hello, wotrld!
text_with_errors = noisocr.simulate_errors(text, 2)
# Output: Hsllo,wlorld!
text_with_errors = noisocr.simulate_errors(text, 5)
# Output: fllo,w0rlr!
4. Simulating Text Annotations
The annotation simulation feature allows the user to add custom markings to the text based on a set of annotations, including those from the BRESSAY dataset.
import noisocr
text = "Hello world."
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, $$--xxx--$$
text_with_annotation = noisocr.simulate_annotation(text, probability=0.5)
# Output: Hello, ##--world!--##
text_with_annotation = noisocr.simulate_annotation(text, 0.01)
# Output: Hello world.
Code Overview
The core functions of the NoisOCR library are based on leveraging libraries like typo
for simulating errors and hyphen
for managing word hyphenation across different languages. Below is an explanation of the critical functions.
1. simulate_annotation
Function
The simulate_annotation
function selects a random word from the text and annotates it, following a defined set of annotations.
import random
annotations = [
'##@@???@@##', '$$@@???@@$$', '@@???@@', '##--xxx--##',
'$$--xxx--$$', '--xxx--', '##--text--##', '$$--text--$$',
'##text##', '$$text$$', '--text--'
]
def simulate_annotation(text, annotations=annotations, probability=0.01):
words = text.split()
if len(words) > 1:
target_word = random.choice(words)
else:
return text
if random.random() < probability:
annotation = random.choice(annotations)
if 'text' in annotation:
annotated_text = annotation.replace('text', target_word)
else:
annotated_text = annotation
result_text = text.replace(target_word, annotated_text, 1)
return result_text
else:
return text
2. simulate_errors
Function
The simulate_errors
function applies various errors to the text, randomly selected from the typo
library.
import random
import typo
def simulate_errors(text, interactions=3, seed=None):
methods = ["char_swap", "missing_char", "extra_char", "nearby_char", "similar_char", "skipped_space", "random_space", "repeated_char", "unichar"]
if seed is not None:
random.seed(seed)
else:
random.seed()
instance = typo.StrErrer(text)
method = random.choice(methods)
method_to_call = getattr(instance, method)
text = method_to_call().result
if interactions > 0:
interactions -= 1
text = simulate_errors(text, interactions, seed=seed)
return text
3. sliding_window
and sliding_window_with_hyphenation
Functions
These functions are responsible for splitting the text into sliding windows, with or without hyphenation.
from hyphen import Hyphenator
def sliding_window_with_hyphenation(text, window_size=80, language='pt_BR'):
hyphenator = Hyphenator(language)
words = text.split()
windows = []
current_window = []
remaining_word = ""
for word in words:
if remaining_word:
word = remaining_word + word
remaining_word = ""
if len(" ".join(current_window)) + len(word) + 1 <= window_size:
current_window.append(word)
else:
syllables = hyphenator.syllables(word)
temp_word = ""
for i, syllable in enumerate(syllables):
if len(" ".join(current_window)) + len(temp_word) + len(syllable) + 1 <= window_size:
temp_word += syllable
else:
if temp_word:
current_window.append(temp_word + "-")
remaining_word = "".join(syllables[i:]) + " "
break
else:
remaining_word = word + " "
break
else:
current_window.append(temp_word)
remaining_word = ""
windows.append(" ".join(current_window))
current_window = []
if remaining_word:
current_window.append(remaining_word)
if current_window:
windows.append(" ".join(current_window))
return windows
Conclusion
NoisOCR provides essential tools for those working on post-OCR text correction, making it easier to simulate real-world scenarios where digitized texts are prone to errors and annotations. Whether for automated testing, text correction model development, or analysis of datasets like BRESSAY, this library is a versatile and user-friendly solution.
Check out the project on GitHub: NoisOCR and contribute to its improvement!
Top comments (0)