DEV Community

Cover image for Easiest Way to OCR Process PDF Documents in ASP.NET Core
Suresh Mohan for Syncfusion, Inc.

Posted on • Originally published at syncfusion.com on

Easiest Way to OCR Process PDF Documents in ASP.NET Core

Optical character recognition (OCR) is a technology used to convert scanned paper documents, in the form of PDF files or images, into searchable, editable data.

The Syncfusion OCR processor library has extended support to OCR process PDF documents and other scanned images in the .NET Core platform from version 18.1.0.42. In this blog, I am going to create an ASP.NET Core web application to OCR process a PDF document. The steps are:

Create an ASP.NET Core web application

Follow these steps to create an ASP.NET Core web application in Visual Studio 2019:

  1. In Visual Studio 2019, go to File > New and then select Project.
  2. Select Create a new project.
  3. Select the ASP.NET Core Web Application template.
  4. Enter the Project name and then click Create. The Project template dialog will be displayed.

Create a new ASP.NET Core Web application dialog box

Install necessary NuGet packages to OCR process PDF documents

Follow these steps to install the Syncfusion.PDF.OCR.Net.Core NuGet package in the project:

  1. Right-click on the project and select Manage NuGet** Packages. Select Mange NuGet Packages**
  2. Search for the Syncfusion.PDF.OCR.Net.Core package and install it. Select Syncfusion.PDF.OCR.NET.Core and install

Perform OCR processing on PDF document

Follow these steps to perform OCR processing on a PDF document in ASP.NET Core:

  1. Syncfusion’s OCR processor internally uses Tesseract libraries to perform OCR, so please copy the necessary tessdata and TesseractBinaries folders from the NuGet package folder to the project folder to use the OCR feature.The tessdata folder contains OCR language data and Tesseractbinaries contains the wrapper assemblies for Tesseract OCR. Please use the following link to download OCR language data for other languages. https://github.com/tesseract-ocr/tessdata Copy tessdata and tesseractbinaries folder from NuGet folder Paste tessdata and TesseractBinaries to the project folder
  2. Set Copy to Output Directory to Copy if newer for Data, tessdata, and TesseractBinaries folders. Set Copy to Output Directory to Copy if newer
  3. Add a new button in index.cshtml.
@{ Html.BeginForm("PerformOCR", "Home", FormMethod.Post);
    {
        <input type="submit" value="Perform OCR" class=" btn" />
    }
}
  1. Include the following namespaces in HomeController.cs.
using Microsoft.AspNetCore.Hosting;
using Microsoft.AspNetCore.Mvc;
using Syncfusion.OCRProcessor;
using Syncfusion.Pdf.Graphics;
using Syncfusion.Pdf.Parsing;
using System.IO;
  1. Include the following code example in HomeController.cs to perform the OCR processing.
public IActionResult PerformOCR()
{
string binaries = Path.Combine(_hostingEnvironment.ContentRootPath, "TesseractBinaries", "Windows");

//Initialize OCR processor with tesseract binaries.
OCRProcessor processor = new OCRProcessor(binaries);
//Set language to the OCR processor.
processor.Settings.Language = Languages.English;

string path = Path.Combine(_hostingEnvironment.ContentRootPath, "Data", "times.ttf");
FileStream fontStream = new FileStream(path, FileMode.Open);

//Create a true type font to support unicode characters in PDF.
processor.UnicodeFont = new PdfTrueTypeFont(fontStream, 8);

//Set temporary folder to save intermediate files.
processor.Settings.TempFolder = Path.Combine(_hostingEnvironment.ContentRootPath, "Data");

//Load a PDF document.
FileStream inputDocument = new FileStream(Path.Combine(_hostingEnvironment.ContentRootPath, "Data", "PDF_succinctly.pdf"), FileMode.Open);
PdfLoadedDocument loadedDocument = new PdfLoadedDocument(inputDocument);

//Perform OCR with language data.
string tessdataPath = Path.Combine(_hostingEnvironment.ContentRootPath, "Tessdata");
processor.PerformOCR(loadedDocument, tessdataPath);

//Save the PDF document.
MemoryStream outputDocument = new MemoryStream();
loadedDocument.Save(outputDocument);
outputDocument.Position = 0;

//Dispose OCR processor and PDF document.
processor.Dispose();
loadedDocument.Close(true);

//Download the PDF document in the browser.
FileStreamResult fileStreamResult = new FileStreamResult(outputDocument, "application/pdf");
fileStreamResult.FileDownloadName = "OCRed_PDF_document.pdf";

return fileStreamResult;
}

By executing this example, you will get the PDF document shown in the following image.

OCRed PDF document

Publish OCR application in Azure App Service

Follow these steps to publish the OCR application in Azure App Service:

  1. In Solution Explorer, right-click the project and choose Publish (or use the Build > Publish menu item). Choose Publish from the given option
  2. In the Pick a publish target dialog box, choose App Service , select Create New and click Create Profile. Pick a publish target dialog
  3. In the Create App Service dialog box that appears, sign in with your Azure account (if necessary). Then, the default app service settings populate the fields. Click Create. Create App Service Dialog
  4. Visual Studio now deploys the app to your Azure App Service, and the web app loads in your browser. The project properties Publish pane shows the site URL and other details. Click Publish to publish the application in Azure App Service. Publish Window

After publishing the application, you can perform OCR processing by navigating to the site URL.

Perform OCR PDF Document

Conclusion

In this blog post, we have learned to perform OCR processing on PDF documents in ASP.NET Core web applications and publish the applications in Azure App Service.

Take a moment to peruse our documentation, where you’ll find other options and features, all with accompanying code examples.

If you have any questions about these features, please let us know in the comments below. You can also contact us through our support forum, Direct-Trac, or feedback portal. We are happy to assist you!

If you liked this article, we think you would also like the following articles about our PDF Library:

The post Easiest Way to OCR Process PDF Documents in ASP.NET Core appeared first on Syncfusion Blogs.

Top comments (0)