In this era of information explosion, we harness vast amounts of data to train AI Large Language Models, set up databases, and more aimed at filtering valuable information. PDF documents, one of the primary data sources, hold a wealth of valuable information. For developers, extracting text from PDFs is the first step for effective data extraction.
Some of you may be concerned about how to extract text from PDFs in C#. iTextSharp always stands out as an effective solution for PDF text extraction. In this guide, we'll delve into utilizing iTextSharp for PDF text extraction in C#, covering everything from installation and project setup to providing code samples. Additionally, we'll introduce and compare it with another robust C# library, ComPDFKit, to help you make informed decisions.
How to Extract Text from PDF in C# Using ComPDFKit?
Download ComPDFKit C# Library for Text Extraction
First, you need to download and install ComPDFKit C# library in Nuget. Please make sure that you have satisfied the system requirements listed below.
Platform: Windows
System Requirements: Windows 7, 8, 10, and 11 (32-bit, 64-bit)
Integrated Development Environment: Visual Studio 2017 or higher
Framework Requirements: .NET Framework 4.6.1 or higher
Create a New Windows Project and Apply the License
Follow the instructions about how to make a program on our Documentation to create a new project. After that, contact our sales to get a free trial license to initialize the ComPDFKit Conversion SDK. Then, insert the license by following.
string resPath = "***";
string libPath = "***";
string license = "***";
CPDFConverter.InitLibrary(libPath);
CPDFConverter.InitResource(resPath);
CPDFConverter.LicenseVerify(license);
Extract Text from PDFs
To extract text from PDF documents in C# using ComPDFKit, simply follow these code samples.
string inputFilePath = "***";
string outputFolderPath = "***";
string outputFileName = "***";
CPDFConverterJsonText converter = CPDFConvertFactroy.CreateConverter(CPDFConvertType.CPDFConvertTypeJsonText, inputFilePath) as CPDFConverterJsonText;
CPDFConvertJsonOptions jsonOptions = new CPDFConvertJsonOptions();
jsonOptions.IsAllowOCR = false;
ConvertError error = ConvertError.ERR_UNKNOWN;
jsonTextConverter.Convert(outputFolderPath, ref outputFileName, jsonOptions, ref error);
Notice
• Disabling OCR (Optical Character Recognition) can result in the inability to extract text from tables within images.
• When we use the CPDFConverterJsonText class to access the content streams from a PDF document, we are often faced with fragmented data. For example, let us say that we are attempting to extract a sentence that says "This is a sample sentence." from a PDF document. You may end up retrieving parts of it as separate content streams like "This" and "is a sample sentence.". This occurs because text objects in PDFs are not always cleanly organized into words sentences, or paragraphs. When OCR is unenabled, the CPDFConverterJsonText class will return Text objects exactly as they are defined in the PDF page content streams.
How to Extract Text from PDFs Using iTextSharp?
The steps to use iTextSharp for text extraction are similar to ComPDFKit. Once you have installed iTextSharp PDF library and created a project. Follow the below example to extract text from PDF files using iTextSharp C# library.
using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
namespace PDFApp2
{
class Program
{
static void Main(string[] args)
{
string filePath = @"C:\Users\buttw\OneDrive\Desktop\highlighted PDF.pdf";
string outPath = @"C:\Users\buttw\OneDrive\Desktop\name.txt";
int pagesToScan = 2;
string strText = string.Empty;
try
{
PdfReader reader = new PdfReader(filePath);
for (int page = 1; page <= pagesToScan; page++)
{
ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
strText = PdfTextExtractor.GetTextFromPage(reader, page, its);
strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
string[] lines = strText.Split('\n');
foreach (string line in lines)
{
using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
{
file.WriteLine(line);
}
}
}
reader.Close();
}
catch (Exception ex)
{
Console.Write(ex);
}
}
}
}
Text Extraction C# Libraries Comparison
iTextSharp, without arguing, is a powerful PDF text extraction library in C#. As an open-source library, it is always a go-to solution for developers to integrate common PDF features into their apps. When comparing to ComPDFKit, it is obvious that using iTextSharp requires writing lengthy and complex codes while ComPDFKit only takes less than 10 lines of code to extract text from PDFs.
With iTextSharp, developers are expected to write custom code or logic to achieve the desired functionality, which requires in-depth knowledge of the library and C#. Fortunately, ComPDFKit C# library provides out-of-the-box features, making text extraction a convenient process for developers.
Therefore, on the premise that ComPDFKit and iTextSharp have similar accuracy in PDF text extraction, ComPDFKit is superior in terms of performance and code readability.
Top comments (0)