DEV Community

Mehr Muhammad Hamza
Mehr Muhammad Hamza

Posted on • Updated on

How to use Tesseract OCR in C#

Introduction:

This article will give you understanding about OCR, how to extract text from images in C# using Tesseract and IronOCR. After reading this article, you would be able to develop C# example in window Form or ASP.Net which will take an image as input, and will return the text as output. You can then use that text for searching or any purpose.

The Flow of Article:

  1. What is OCR..?
  2. What is Tesseract OCR .?
  3. What is IronOCR ?
  4. Extract text from the image using IronOcr - Step by Step Guide. a. Create Project b. Install Nuget Package for IronOcr c. Design Window Form d. Write Code e. Run the Solution
  5. Extract text from the images of different language using IronOcr.
  6. Extract text from the image using Tesseract - Step by Step Guide. a. Install Nuget Package for Tesseract b. Write Code c. Run the Solution
  7. Fair Comparison between Tesseract and IronOcr
  8. Why IronOcr
  9. Summary

What is OCR (Optical Character Recognition)?

OCR stands for "Optical Character Recognition." It is a technology that recognizes text within a digital image. It is commonly used to recognize text in scanned documents and images.

OCR (Optical Character Recognition) software can be used to convert a physical paper document, or an image into an accessible electronic version with text. For example, if you scan a paper document or photograph with a printer, the printer will most likely create a file with a digital image in it. The file could be a JPG/TIFF or PDF, but the new electronic file may still be only an image of the original document. You can then load this scanned electronic document it created, which contains the image, into an OCR program. The OCR program which will recognize the text and convert the document to an editable text file.

What is C# Tesseract OCR.?

Tesseract engine optical character recognition (OCR) is a technology used to convert scanned paper documents, PDF files, and images to searchable text data. The OCR engine detects the characters present in the image and puts those characters into words, enabling developers to search and edit the content of the document.

What is IronOCR.?

IronOcr is another Optical Character Recognition Technology. It is a .Net Library that is used to convert images into editable and readable text. This library helps us to read text from images in our C# Application. This library has more support for more than 100 languages, meaning that you can get the text from the image in any language be it English or Persian.

Let us show how we can use IronOCR in our Application.

How to use IronOCR to Extract text from the image - Step By Step Guide:

Step # 1: Open Visual Studio and Create Project:

Open Visual Studio, I am using Visual Studio 2019, you can use any version of the visual studio.
image
Select on Create New Project. Select Windows form Application from the template.
image
Press Next. Name the Project, select Location, and Press Next.
image
Press Next, Select Target Framework, I have chosen .Net (5.0) you can choose any of your choice. then Press Finish. Window Form Application will be created as shown below.
image
Before proceeding further, we have to install Nuget Package of the IronOCR to use it in our program.

Step # 2: Install Nuget Package IronOcr:

Open Nuget Package Manager Console from Tools > Nuget Package Manager > Package Manager Console.
image
Package Manager Console will be open as shown below.
image
Type Install-Package IronOcr in Nuget Package Manager Console and Press Enter.
image
It will start installing IronOCR in your Project. Wait for a while. Open your window form to design our Application after installation.

Step # 3: Design Window Form :

Open Tool Box, Drag 1 Label ( for labeling our Program) , 2 Buttons (1 for selecting image, and another for converting image into text), 1 Text Box to display the Image Path, 1 Picture Box to display the Image and 1 RichTexBox to display the extracted text.

Design the Form as per your choice. I have designed it in the following way.
image
Let’s code behind the buttons to see how easy it is to Extract the text from an image using IronOcr.

Step # 4: Write Code behind the Buttons.

Double click on Select Image Button.

Following Code will appear.

private void SelectImage_Click(object sender, EventArgs e)
        {

        }
Enter fullscreen mode Exit fullscreen mode

Write the following Code inside this Function.

private void SelectImage_Click(object sender, EventArgs e)
        {
            OpenFileDialog open = new OpenFileDialog();
            // image filters  
            open.Filter = "Image Files(*.jpg; *.jpeg; *.gif; *.bmp)|*.jpg; *.jpeg; *.gif; *.bmp";
            if (open.ShowDialog() == DialogResult.OK)
            {
                // display image in picture box  
                pictureBox1.Image = new Bitmap(open.FileName);
                // image file path  
                ImagePath.Text = open.FileName;
            }
        }
Enter fullscreen mode Exit fullscreen mode

Now, double click on Convert to Text Button, following code will appear.

private void ConvertToText_Click(object sender, EventArgs e)
        {

        }
Enter fullscreen mode Exit fullscreen mode

Add the following Namespace at the top of the file.

using IronOcr;
Now, Add following Code inside the ConvertToText_Click() function.

 private void ConvertToText_Click(object sender, EventArgs e)
        {
            IronTesseract IronOcr = new IronTesseract();
            var Result = IronOcr.Read(ImagePath.Text);
            richTextBox1.Text = Result.Text;
        }
Enter fullscreen mode Exit fullscreen mode

See, we have write just three lines of code to perform such a big task. Thanks to IronOcr.

Step # 5: Run the Project:

Let’s run the Project

Press Ctrl + F5 to run the Project
image
Click on the Select Image button to select the image.
image
Select an Image, you want to select. I am selecting the snapshot of an article, you can select any as per your requirement.
image
Now, Press on the Convert to Text button to extract all the text from this News paper Image in English Language as shown below.
image
You can see that I have easily extracted text from an image of the article. It’s very accurate and easy to develop. Thanks to IronOcr for making our job easy.

Use IronOcr to Extract text from different Language Image:

IronOcr also supports more than 100 languages. Let’s test the same with Chinese language.

For extracting any other language other than English, You have to Install Nuget Package of that Particular language. Let’s suppose we want to extract Chinese language.

Step # 1: Install Nuget Package for the Particular Language:

Install the following Nuget Package.

Write following command in the Nuget Package Manager Console of your Visual Studio.

Install-Package IronOcr.Languages.Chinese
image
Amend following changes in the code.

IronOcr.Language = OcrLanguage.ChineseSimplified;
Such as:

 private void ConvertToText_Click(object sender, EventArgs e)
        {
            IronTesseract IronOcr = new IronTesseract();
            IronOcr.Language = OcrLanguage.ChineseSimplified;
            var Result = IronOcr.Read(ImagePath.Text);
            richTextBox1.Text = Result.Text;
        }
Enter fullscreen mode Exit fullscreen mode

Let’s test again.

Step # 2: Run the Project:

image
It can be seen that, we have easily converted our Chinese language image into text with just one line of code. IronOcr .Net library provides us accuracy, efficiency, performance and easy way to use it into our .Net Application.

How to Extract Text from the Image using traditional Tesseract: Step By Step Guide:

Let us see in the following example that how we can develop the same with using Tesseract OCR. Keep everything same such as Our designed window form. we will just change code behind the ConvertToText_Click buton. Let's everything will remain same as shown above.

Step # 1: Install Nuget Package for Tesseract:

Before using Tesseract in our project, we have to install Tesseract Nuget Package.

Write the following command in the Nuget Package Manager Console.

Install-Package Tesseract
image
After installing Nuget Package, you have to install Language files manually in the project folder. you can say this is the drawback of this library. Download Language Files from the following link . Unzip it and copy the tessdata folder in the Debug folder of your Project.

Now, write the following code inside the ConvertToText_Click Function

private void ConvertToText_Click(object sender, EventArgs e)
        {
            var ocrengine = new TesseractEngine(@".\tessdata", "eng", EngineMode.Default);
            var img = Pix.LoadFromFile(ImagePath.Text);
            var res = ocrengine.Process(img);
           richTextBox1.Text = res.GetText();
        }
Enter fullscreen mode Exit fullscreen mode

Step # 2: Run the Project:

Press Ctrl + F5 to run the project. Select Image file you want to convert. I have select same file in English Language as before. Press Convert to Text button to extract text from the image. Following window will appear.
image
Tesseract also support different languages image. However, we have to add separate languages file into our project folder.

Now, It is clearly understood that IronOcr .Net Library is more easy to use and easy to understandable.

Fair Comparison Between IronOcr and Tesseract:

Interoperability:

In Tesseract, we are working with C++ Library most of the time. Interoperability is not good in .Net and has poor cross platform and Azur- compatibility. It requires us to choose the bittiness of our application, meaning that we may only deploy to 32 or 64 bit targets. Visual C++ runtimes is required for running the Tesseract.

In IronOcr, complete installation including languages is done using Nuget Package manager as shown before. We don't need to install native exe or dll. Everything is handled by single .Net Component Library.

Up to Date & Maintained:

The latest builds of Tesseract 5 have never been designed to compile on Windows. Installing Tesseract 5 for C# for free requires manually modifying and compiling Leptonica and Tesseract for Windows. In addition, free C# API wrappers on GitHub may be years behind or incompatible.

we can runs IronOcr on Windows, MacOS, Linux, Azure, AWS, Lambda, Mono and Xamarin Mac with little or no configuration. No native binaries to manage. Framework and Core compatible.

Why IronOcr.?

We should use IronOcr for Tesseract Management because of following reasons.

  1. It is completely worked on .Net
  2. You don not have to installed Tesseract on your machine.
  3. It allows you to run the latest Engine such as Tesseract.
  4. It is available for all .Net Project such as: a. .Net Framework 4.5 b. .Net Framework Standard 2 c. .Net Core 2, 3 and 5
  5. It has improved accuracy, speed, efficiency and performance over traditional tesseract
  6. It supports latest technologies such as Xamarin, Mono, Azure and Docker
  7. It manage the complex Tesseract Dictionary system using Nuget Package.
  8. It Support Pdf, MultiFrame Tiff and all major image format without any configuration.
  9. It can correct low quality scans document or images to the best result from tesseract.
  10. It just requires few lines of code to use this feature in our Application.

Summary:

IronOcr is a latest, up to date and well maintained Character Recognition technique for use in .Net. It provides accuracy, speed, simplicity and usability. You can download this Product from here For more in depth and advanced study of IronOcr please refer to this link.

Now, what are you waiting for.? Get the License and start using it.

I hope that you have fined this article understandable and well organized. If you have any question or query, you can ask in the comment section below.

Discussion (0)