Mohammed Ibrahim

Posted on Oct 23, 2023

How to use Tesseract OCR for .NET on Windows

#tesseract #ocr #csharp #window

1.0 Introduction

OCR (Optical Character Recognition) technology has revolutionized the way scanned documents are used in today's digital environment. We can easily edit and interact with PDF documents because of this technology, which allows computers to recognize and extract text from a variety of sources, including scanned PDF documents. With optical character recognition (OCR) software like Adobe Acrobat, it is quick and easy to extract text from scanned documents and turn them into editable PDFs or searchable image-based PDFs.
Tesseract OCR

2.1 Introduction

Tesseract was first developed between 1985 and 1994 at Hewlett-Packard Laboratories in Bristol, UK, and Hewlett-Packard Co. in Greeley, Colorado, USA. In 1996, more improvements were made to move the program to Windows, and in 1998, C++ization was added. HP made Tesseract open source in 2005. Google worked on its development from 2006 till November 2018. The current stable version, known as major version 5, was released on November 30, 2021, with release 5.0.0. You can get bugfix and newer minor versions from GitHub.

2.2 Features of Tesseract

In addition to supporting the legacy Tesseract OCR engine of Tesseract 3, which recognizes character patterns, Tesseract 4 introduces a new neural net (LSTM) based OCR engine that is focused on line identification. The Legacy OCR Engine option is required to provide compatibility with Tesseract 3. Additionally, trained data files supporting the legacy engine are required. Currently serving as lead developer is Stefan Weil. Up until 2018, Ray Smith served as the primary developer. Zdenko Podobny is the maintainer.
Tesseract can recognize more than 100 languages "out of the box" and supports Unicode (UTF-8).
Tesseract is compatible with PNG, JPEG, and TIFF image formats.
Several output formats are supported by Tesseract, including plain text, hOCR (HTML), PDF, TSV, ALTO, and PDF with invisible text only.
It should be noted that in many situations, you will need to enhance the quality of the image you are feeding Tesseract in order to obtain better OCR results.
It is possible to train Tesseract to recognize different languages.
Tesseract opens input pictures using the Leptonica library. Leptonica, which has built-in support for zlib, png, and tiff, is advised.

2.3 Install tesseract ocr windows

Step 1: Download Tesseract Setup
To install Tesseract OCR for Windows, download the appropriate.exe Windows installer for your computer's operating system. To get the setup, click this link https://github.com/UB-Mannheim/tesseract/wiki.
Step 2: Tesseract Installation
Step 2.1: Double-click the Exe file which will trigger the installation process.

Step 2.2: Once the setup process starts it will display the above menu. It is only the language used in the help and dialogue windows. We will have the chance to install them later if we would like to use Tesseract OCR for Windows in more than one language.

Step 2.3: Before proceeding with the installation, the setup screen advises closing any other apps. Tesseract OCR for Windows installation doesn't require you to end other open programs, but if you don't, you might need to restart your computer for the installation to finish.

Step 2.4: Since Tesseract OCR is licensed under the Apache Licence 2.0, which is open source, you are free to redistribute Tesseract and Tesseract modifications without worrying about paying royalties. To proceed with the installation, you must accept the license agreement.

Step 2.5: Here, we will select the users for whom we want to install Tesseract on our computer.

Step 2.6: We will then select the installation site. Make sure you copy the install location to a.txt file before moving on to the next step. After the installation is finished, we must add the installation location to our machine's environment variables.

Step 2.7: By default, the following are selected: Language data, Shortcuts generation, Training Tools, and ScrollView. We would like to leave all of these selected unless there is a clear reason why you would not want them installed.

Step 2.8: We will be prompted to select the Tesseract OCR for Windows shortcuts start menu folder during the final installation process. Mine is still configured with the default name "Tesseract-OCR."Tesseract OCR for Windows will start to install after we click the install button. The installation path needs to be added to our machine's environment variables next.
Step 3: Environment variables Setup
Step 3.1: Go to the Start menu and type "environment variables" to add the installation location to our environment variables. A result to edit the system environment variables needs to be shown. If not, you may always utilize these procedures: Go to Control Panel > Start menu > Edit system environment variables.

Step 3.2: attempting to find "environment variables." When the "System Properties" dialogue box appears, select the "Advanced" tab first, and then click the "Environment Variables..." button in the lower right corner of the screen.

Step 3.3: To edit system variables, select the 'Edit...' button.

Step 3.4: When the "Edit environment variable" page appears, select "New" and enter the installation path for Tesseract OCR that we copied in Step 2 previously. After completing this, press the "OK" button.

Step 3.5: And that's it! We can verify that our installation is successful by running Tesseract on a test image after executing the.exe installer and adding the Tesseract OCR for Windows install path to our environment variables.
Step 4: Run Tesseract ocr Test
Use the Tesseract command at the command prompt to verify that Tesseract OCR for Windows was installed successfully. An output providing a brief overview of Tesseract's usage options ought to appear.

3.0 Extract text from the image using tesseract ocr

Now that Tesseract has been installed successfully, let's test the OCR functionality on a test paper to see how accurate it is. Place the image now in the directory where your command prompt is located. Next, we can use the following command to launch Tesseract on the image.

We are launching Tesseract, giving it the sample picture test.png, and instructing it to save the extracted text in a new file named test.txt.
The contents of our picture are now included in a new text file that we can see if we peek inside our current directory. With only a few noticeable mistakes, Tesseract's text extraction was remarkably accurate when compared to the text in the sample image and our freshly created.txt file.
Sample Image below which we used for the Tesseract process.

4.0 IronOCR

As a native C# OCR library, IronOCR improves Tesseract over the conventional Tesseract library, offering better accuracy, stability, and performance. Using .NET tools and websites makes text extraction from PDFs and photos possible. IronOCR is able to provide structured data or plain text in a large number of foreign languages. Images with embedded text can be read by it, as can barcodes. Use of the Iron Software OCR library is possible for Dot NET console, web, MVC, and desktop applications. For commercial deployments, the development team provides hands-on support. The latest versions of Visual Studio are compatible with IronOCR.
4.1 Advantage of IronOCR

IronOCR can scan barcodes, QR codes, and paper documents from different pictures or PDF files using the current Tesseract 5 engine. The integration of OCR into desktop, console, and web applications is made easier with this package.
We can do OCR, turning scanned PDFs into searchable PDFs, with the help of IronOCR.
In addition to bespoke languages and word lists, IronOCR supports 127 other languages worldwide.
More than 20 distinct kinds of barcodes and QR codes can be scanned by the program.
IronOCR can produce plain text output in addition to barcode data. Developers using a different structured data object paradigm can access all material for direct system input. This covers headings, paragraphs, lines, words, and characters in web applications that are arranged rationally.

4.2 OCR Processing Using IronOCR

Access to data and the conversion of PDF documents are made possible by IronOCR, a powerful OCR library. Without jeopardizing data privacy, it makes the conversion into machine-readable text easier for effective analysis and processing. Here's an illustration of how to use IronOCR to use OCR to extract text from an image:
`var Ocr = new IronTesseract();
Ocr.Language = OcrLanguage.EnglishBest;
Ocr.Configuration.TesseractVersion = TesseractVersion.Tesseract5;

using (OcrInput ocrInput = new OcrInput("Demo.gif"))
{
OcrResult ocrResult = Ocr.Read(ocrInput);
Console.WriteLine(ocrResult.Text);
}`
We add functionalities to Iron Tesseract in the given code snippet. The creation of an OcrInput object makes it easier to add one or more image files. Provide the picture path in the code when utilizing the OcrInput object's Add function. You may utilize as many invoice pictures as you require. We access images by using the "Read" function of the previously built IronOCR object to parse image documents and extract results into the OCR output. It is capable of taking text out of pictures and stringing it together.

The output, which shows text taken from the supplied image, shows that the extraction process was done correctly. In addition, IronOCR offers a number of output formats for storing outcomes.

To know more about the Tesseract/IronOCR Tutorial click here https://ironsoftware.com/csharp/ocr/blog/ocr-tools/tesseract-ocr-windows/.

5.0 Conclusion

Although Tesseract is not a full OCR toolkit for.NET, it is a great tool for C++ developers. A wrapper Dot Net is available but it is not updated with the latest tesseract and it has only limited supporting documents. In order for Tesseract to function with scanned or photographed images accurately, they must first be treated to be orthogonal, standardized, high-resolution, and devoid of digital noise.
However, IronOCR provides support for a number of .NET projects. IronOCR improves Tesseract's output and fixes erroneously scanned words or images. The NuGet Package oversees the intricate Tesseract dictionary system. To construct an OCR tool, use the Iron OCR Library. With minimum code required, IronOCR is the best invoice OCR software for automating invoices and extracting data.
With support for many image formats, PDF files, and MultiFrame TIFF, IronOCR eliminates the need for extra preparations and provides a smooth experience. By offering barcode identification capabilities for data extraction from images including barcodes, it goes beyond optical character recognition. The cost-effective development edition of IronOCR is available for free for thirty days.

DEV Community

How to use Tesseract OCR for .NET on Windows

1.0 Introduction

2.1 Introduction

2.2 Features of Tesseract

2.3 Install tesseract ocr windows

3.0 Extract text from the image using tesseract ocr

4.0 IronOCR

4.2 OCR Processing Using IronOCR

5.0 Conclusion

Top comments (0)

Read next

I MADE A GAME IN SCRATCH

Use Valgrind in your CI / CD

SSH Raspberry Pi via Cell Phone

Simplify PDF Generation in Node.js with html-to-pdf-pup