DEV Community

Cover image for How to use Tesseract OCR in C# - Full Tutorial
Mehr Muhammad Hamza
Mehr Muhammad Hamza

Posted on • Edited on

How to use Tesseract OCR in C# - Full Tutorial

Last updated: Aug 26, 2024

Optical Character Recognition (OCR) technology has revolutionized the way we interact with documents, images, and text data. By converting scanned images and PDFs into searchable and editable text, OCR opens up a world of possibilities for automation, data extraction, and text analysis. In this tutorial, we will walk you through using Tesseract OCR in C#, leveraging the power of IronOCR, a comprehensive .NET library that simplifies OCR processes. Whether you're working on Windows Forms, ASP.NET, or any other .NET framework, this guide will equip you with the knowledge to extract text from images quickly and efficiently.

Why Choose IronOCR for Tesseract OCR?

IronOCR is more than just a library; it's a robust solution that encapsulates the Tesseract OCR engine within a user-friendly .NET wrapper. By using IronOCR, you get access to the advanced capabilities of Tesseract, coupled with enhanced features like error correction, language support, and cross-platform compatibility. The library is designed for developers who want to integrate OCR functionality into their .NET applications with minimal effort and maximum flexibility.

Key Benefits of IronOCR:

  • Seamless Integration: Works across various .NET frameworks including .NET Framework 4.5, .NET Standard 2.0, .NET Core 2, 3, 5,6,7 and 8.
  • Enhanced Accuracy: Provides advanced image pre-processing to correct low-quality scans, ensuring better OCR results.
  • Extensive Language Support: Supports over 150 languages, making it ideal for global applications.
  • Speed and Efficiency: Optimized for high performance, enabling fast and accurate text extraction even from complex documents.

IronOCR

Setting Up Your Project

Begin by creating a new C# project in Visual Studio. You can choose any project type, such as a Console App, Windows Forms, or ASP.NET application. Once your project is set up, you'll need to install the IronOCR package via NuGet.

Step # 1: Open Visual Studio and Create Project

Open Visual Studio. I am using Visual Studio 2019, but you can use any version.
image
Select “Create New Project”. Select the Windows Form Application from the template.
image
Click “Next”. Name the Project, select Location, and click “Next”.
image
Click “Next” and select the “target framework''. I have chosen .Net (5.0), but you can choose your preferred option. Click “Finish”. The Windows Form Application will be created as shown below.
image
Before proceeding further, we need to install the Nuget Package for IronOCR.

Step # 2: Install Nuget Package IronOcr

Open the Nuget Package Manager Console from Tools > Nuget Package Manager > Package Manager Console.
image
The Package Manager Console will open as shown below.
image
Type “Install-Package IronOcr” in the Nuget Package Manager Console and click “Enter”.
image
IronOCR will begin installing in your project. Wait for a while. After installation is complete, open your Windows Form and design your Application.

Step 3: Designing Your Application Interface (Windows Forms Example)

For this tutorial, we'll create a simple Windows Forms application that allows users to select an image, perform OCR, and display the extracted text. Start by designing your form with the following controls:

  • Label: To display the title or instructions.
  • Buttons: One for selecting an image and another for converting the image to text.
  • TextBox: To display the selected image path.
  • PictureBox: To show the selected image.
  • RichTextBox: To display the extracted text.

Your form might look something like this:

image

Now that the interface is ready, let's write the code to handle image selection and OCR processing.

Step # 4: Writing the Code behind the Buttons

Double-click on the “Select Image” button.
The following code will appear:



private void SelectImage_Click(object sender, EventArgs e)
        {

        }


Enter fullscreen mode Exit fullscreen mode

Write the following code inside this function:



private void SelectImage_Click(object sender, EventArgs e)
        {
            OpenFileDialog open = new OpenFileDialog();
            // image filters  
            open.Filter = "Image Files(*.jpg; *.jpeg; *.gif; *.bmp)|*.jpg; *.jpeg; *.gif; *.bmp";
            if (open.ShowDialog() == DialogResult.OK)
            {
                // display image in picture box  
                pictureBox1.Image = new Bitmap(open.FileName);
                // image file path  
                ImagePath.Text = open.FileName;
            }
        }


Enter fullscreen mode Exit fullscreen mode

Next, double-click on the “Convert to Text Button” and the following code will appear:



private void ConvertToText_Click(object sender, EventArgs e)
        {

        }


Enter fullscreen mode Exit fullscreen mode

Add the following namespace at the top of the file: using IronOcr;

Next, add the following code inside the ConvertToText_Click() function:



 private void ConvertToText_Click(object sender, EventArgs e)
        {
            IronTesseract IronOcr = new IronTesseract();
            var Result = IronOcr.Read(ImagePath.Text);
            richTextBox1.Text = Result.Text;
        }


Enter fullscreen mode Exit fullscreen mode

As you can see, we only needed to write three lines of code to perform this major task, all thanks to IronOcr.

Step # 5: Run the Project

Let’s run the Project.
Press Ctrl + F5 to run the Project.
image
Click on the “Select Image” button to select the image.
image
Select an image of your choice. I am selecting a snapshot of an article, but you can select any of your choosing.
image
Next, click the “Convert to Text” button to extract all the text from this newspaper image as shown below.
image
You can see that I have easily extracted text from an image of the article. It is very accurate and easy to use for any ongoing purpose. IronOcr has made this job incredibly easy.

Using IronOcr to Extract Text in Different Languages

One of the standout features of IronOCR is its support for over 150 languages. Whether you need to extract text in English, Chinese, Arabic, or any other language, IronOCR makes it straightforward.

Step # 1: Install the Nuget Package for the Specific Language

To extract text in a language other than English, you need to install the corresponding language package via NuGet. For example, to work with Chinese, use the following command:



Install-Package IronOcr.Languages.Chinese 


Enter fullscreen mode Exit fullscreen mode

image
Once the language package is installed, update your code to specify the language: IronOcr.Language = OcrLanguage.ChineseSimplified;
Such as:



 private void ConvertToText_Click(object sender, EventArgs e)
        {
            IronTesseract IronOcr = new IronTesseract();
            IronOcr.Language = OcrLanguage.ChineseSimplified;
            var Result = IronOcr.Read(ImagePath.Text);
            richTextBox1.Text = Result.Text;
        }


Enter fullscreen mode Exit fullscreen mode

Let’s do the test again.

Step # 2: Run the Project

image
We can see that we have easily converted our Chinese language image into text with just one line of code. The IronOcr .Net library provides accuracy, efficiency, and an easy method to employ with our .Net Application.

How to Extract Text from the Image using Traditional Tesseract: A Step-by-Step Guide

Let’s look at the following example to see how we can achieve the same goal using Tesseract OCR. We can keep the same Windows Form as the previous example and just change the code behind the “ConvertToText”_Click button. Everything else will remain the same as before.

Step # 1: Install Nuget Package for Tesseract

Write the following command in the Nuget Package Manager Console.

Install-Package Tesseract
image
After installing the Nuget Package, you must install the language files manually in the project folder. One could say that this is a drawback of this particular library. Download the language files from the following link .Unzip it and copy the tessdata folder in the debug folder of your project.
Next, write the following code inside the ConvertToText_Click function:

Now, write the following code inside the ConvertToText_Click Function



private void ConvertToText_Click(object sender, EventArgs e)
{
var ocrengine = new TesseractEngine(@".\tessdata", "eng", EngineMode.Default);
var img = Pix.LoadFromFile(ImagePath.Text);
var res = ocrengine.Process(img);
richTextBox1.Text = res.GetText();
}

Enter fullscreen mode Exit fullscreen mode




Step # 2: Run the Project

Press Ctrl + F5 to run the project. Select the image file you want to convert. I have selected the same file in the English language as in the previous example. Click the “Convert to Text” button to extract the text from the image. The following window will appear:
image
Tesseract also supports images featuring different languages. However, we have to add separate language files into our project folder.
It is now becoming clear that the IronOcr .Net Library is far easier to use.

Now, It is clearly understood that IronOcr .Net Library is more easy to use and easy to understandable.

Practical Use Cases for OCR in C

IronOCR's versatility makes it a valuable tool in various industries and applications. Here are some common use cases:

  • Document Digitization: Convert scanned documents into searchable and editable text files.
  • Data Extraction: Automate the extraction of text data from forms, invoices, and receipts.
  • Archiving: Create digital archives of printed materials, making them easily searchable.
  • Accessibility: Improve accessibility by converting image-based text into machine-readable formats.
  • Translation: Extract text in different languages for translation or localization projects. ##Speed, Efficiency, and Error Handling IronOCR is designed to be fast and efficient, capable of processing large volumes of images quickly without compromising accuracy. This is particularly important for applications that require real-time OCR, such as document scanning and data entry automation.

Common Errors and Troubleshooting

While IronOCR is user-friendly, you might encounter some common errors during implementation. Here’s how to troubleshoot them:

  • Incorrect Language Setting: Ensure that the correct language package is installed and specified in the code.
  • Low-Quality Images: If the OCR results are inaccurate, consider pre-processing the image to enhance quality.
  • Missing Dependencies: Verify that all necessary packages are installed via NuGet, and ensure there are no missing files.

Advanced Features of IronOCR

Beyond basic OCR, IronOCR offers several advanced features:

  • Image Pre-processing: Automatically corrects skewed, noisy, or low-quality images for better OCR accuracy.
  • PDF and TIFF Support: Extract text from multi-page PDFs and TIFFs without additional configuration.
  • Multithreading: Speed up OCR processing by leveraging multiple CPU cores.
  • Machine Learning Integration: Advanced machine learning algorithms improve text recognition and error correction.
  • These features make IronOCR not only powerful but also flexible, catering to a wide range of OCR needs.

Conclusion:

IronOCR stands out as a top choice for developers integrating OCR in C# applications, offering a seamless experience with its easy integration, support for over 150 languages, and powerful features like image pre-processing and multithreading. Whether you're building simple or complex OCR solutions, IronOCR simplifies text extraction from images, catering to developers of all experience levels. Start your journey with IronOCR by downloading the library and exploring its extensive documentation. With regular updates and a free trial by Iron Software, you have everything you need to build robust, OCR-powered applications. Happy coding!

Top comments (4)

Collapse
 
boldtm profile image
Tomasz Mętek • Edited

Title is about Tesseract, article not really...
Seems like clickbait.

Collapse
 
1mouse profile image
Mohamed Elfar

nice content <3

Collapse
 
mhamzap10 profile image
Mehr Muhammad Hamza

Thank you

Collapse
 
sheldon_connor_558f880325 profile image
Sheldon Connor

How do I get the location of a word in a image using ironOcr?