DEV Community

Cover image for How to Parse and Extract Content from PDF Documents in C# VB.NET
Chelsea Devereaux for MESCIUS inc.

Posted on • Updated on • Originally published at grapecity.com

How to Parse and Extract Content from PDF Documents in C# VB.NET

As the current health situation continues to keep people in their home offices, it is ever more essential to have tools that allow for serious collaboration while in an online environment. Although many of the tools required for collaboration were available, their proliferation has been brought about by the Covid-19 crisis over the past two years.

It has opened everyone's eyes to the true need for these tools. Regardless of the current situation, GrapeCity Documents for PDF v5 continues the tradition of improving and upgrading the handling of text within PDF documents and adding many other upgrades and feature enhancements. Specifically, being able to parse/read text from a PDF using C# and modify text throughout a PDF document. New samples also help developers get up and running fast on editing PDF documents.

Starting with version 3.2, and continuing today, the logic is improving regarding parsing, extracting, and reading text from a PDF, efficiently handling individual cases such as text rendered multiple times to create bold or shadowed text effects so that text is not repeated in the output but only appears once in the document.

The FindText method returns a FoundPosition object, returning an array of Quadrilateralstructures from its Bounds property – the FindText method finds text which spans more than one line. A new property ITextMap.Paragraphs returns a collection of ITextParagraph objects associated with the ITextMap.

In this blog, you can expect to learn the following:

  • Parse and extract data from a PDF
  • Reading and parsing text from a PDF using C#
  • Save your extracted data to another PDF file
  • Parse, read and extract text from a PDF across multiple lines or paragraphs

Create your C# PDF Parsing Code with the ITextMap.Paragraphs Property

This example reads an existing multi-page PDF document and shows how to use ITextMap.Paragraphs to extract paragraphs from each page of a PDF document. The complete example and code are included in the updated sample explorer for GrapeCity Documents for PDF.

extract

The code extracts the text paragraphs on each page, rendering each section in alternating colors (for clarity) in a new PDF document:

extractFigure 2 Extract Paragraphs from a PDF Sample

First, the code creates a new PDF document where the text paragraphs will be rendered and adds a note explaining the sample at the top of the first page:

const int margin = 36;  
                Color c1 = Color.PaleGreen;  
                Color c2 = Color.PaleGoldenrod;

                GcPdfDocument doc = new GcPdfDocument();  
                var page = doc.NewPage();

                var rc = Common.Util.AddNote(  
                    "Here we load an existing PDF (Wetlands) into a temporary GcPdfDocument, " +  
                    "and iterate over the pages of that document, printing all paragraphs found on the page. " +  
                    "We alternate the background color for the paragraphs so that the bounds between paragraphs are more clear. " +  
                    "The original PDF is appended to the generated document for reference.",  
                    page,   
                    new RectangleF(margin, margin, page.Size.Width - margin * 2, 0));

                // Text format for captions:  
                var tf = new TextFormat()  
                {  
                    Font = GCTEXT.Font.FromFile(Path.Combine("Resources", "Fonts", "yumin.ttf")),  
                    FontSize = 14,  
                    ForeColor = Color.Blue  
                };  
                // Text format for the paragraphs:  
                var tfpar = new TextFormat()  
                {  
                    Font = StandardFonts.Times,  
                    FontSize = 12,  
                    BackColor = c1,  
                };  
                // Text layout to render the text:  
                var tl = page.Graphics.CreateTextLayout();  
                tl.MaxWidth = doc.PageSize.Width;  
                tl.MaxHeight = doc.PageSize.Height;  
                tl.MarginAll = rc.Left;  
                tl.MarginTop = rc.Bottom + 36;  
                // Text split options for widow/orphan control:  
                TextSplitOptions to = new TextSplitOptions(tl)  
                {  
                    MinLinesInFirstParagraph = 2,  
                    MinLinesInLastParagraph = 2,  
                    RestMarginTop = rc.Left,  
                };
Enter fullscreen mode Exit fullscreen mode

Code Analysis of GcPdf Parsing/Reading PDF with C

A new GcPdfDocument doc object is created and generates a new page using the NewPagemethod. Then it adds a sample explanation note on the first page using the helper function AddNote.

Next, new separate TextFormat objects are created to format the captions and paragraphs, and a new TextLayout object is created to specify the page margins.

Finally, a new TextSplitOptions object is made to handle pagination.Using the new ITextMap.Paragraphs property, the code required to perform this task is straightforward:

// Open an arbitrary PDF, load it into a temp document and get all page texts:  
               using (var fs = File.OpenRead(Path.Combine("Resources", "PDFs", "Wetlands.pdf")))  
               {  
                   var doc1 = new GcPdfDocument();  
                   doc1.Load(fs);

                   for (int i = 0; i < doc1.Pages.Count; ++i)  
                   {  
                       tl.AppendLine(string.Format("Paragraphs from page {0} of the original PDF:", i + 1), tf);

                       var pg = doc1.Pages[i];  
                       var pars = pg.GetTextMap().Paragraphs;  
                       foreach (var par in pars)  
                       {  
                           tl.AppendLine(par.GetText(), tfpar);  
                           tfpar.BackColor = tfpar.BackColor == c1 ? c2 : c1;  
                       }  
                   }

                   tl.PerformLayout(true);  
                   while (true)  
                   {  
                       // 'rest' will accept the text that did not fit:  
                       var splitResult = tl.Split(to, out TextLayout rest);  
                       doc.Pages.Last.Graphics.DrawTextLayout(tl, PointF.Empty);  
                       if (splitResult != SplitResult.Split)  
                           break;  
                       tl = rest;  
                       doc.NewPage();  
                   }  
                   // Append the original document for reference:  
                   doc.MergeWithDocument(doc1, new MergeDocumentOptions());  
               }  
               // Done:  
               doc.Save(stream);
Enter fullscreen mode Exit fullscreen mode

Perform a Code Analysis of GcPDF Parsing/Reading PDF with C#*

First, the Wetlands.pdf document is opened and the new ITextMap.Paragraphs API is used to get the text paragraphs and append them into a different document. After each paragraph is appended, the TextFormat class is used for the paragraphs and updates tfpar to alternate the background color, highlighting the separate paragraphs in the new document.

Then the final document is completed using TextLayout.PerformLayout and TextLayout.Split to paginate the results, merging those into the output document using GdPdfDocument.MergeWithDocument.

The final result is saved using GcPdfDocument.Save.

Enhanced FindText Across Multiple Lines

extract

Finalize Your C# PDF Parsing/Reading code and Extract Data (Save)

The FindText method now supports finding a text which appears in multiple lines in a paragraph or across paragraphs. To illustrate this, code is added similar to the code in the FindText demo sample, which searches for longer text strings that span across multiple lines and paragraphs. Here is the code added immediately above the code calling doc.Save(stream):

var findIt = doc.FindText(new FindTextParams("Hundreds, if not thousands, of invertebrates that form the food of birds also rely on water for most, if not all, phases of their existence.", true, false), OutputRange.All);   
                foreach (var find in findIt)   
                    foreach (var ql in find.Bounds)   
                        doc.Pages[find.PageIndex].Graphics.FillPolygon(ql, Color.FromArgb(100, Color.OrangeRed));   
                var findIt2 = doc.FindText(**new** FindTextParams("To lose any more of these vital areas is almost unthinkable. Wetlands enhance and protect water quality in lakes and streams where additional species spend their time and from which we draw our water.", true, false), OutputRange.All);   
                foreach (var find in findIt2)   
                    foreach (var ql in find.Bounds)   
                        doc.Pages[find.PageIndex].Graphics.FillPolygon(ql, Color.FromArgb(100, Color.OrangeRed));  
                // Done:
Enter fullscreen mode Exit fullscreen mode

Parse/Read the Text Across Mulitple Lines or Paragraphs with C# and GcPdf API

Use the FindText method to find two longer text strings, where the first string spans across multiple lines, and the second string spans across various paragraphs—the FoundPosition.Bounds property returns an array of Quadrilateral structures, forming the bounds in each successive line or section.

The code uses GcGraphics.FillPolygon to highlight the found text and fill the area of the found text with a semi-transparent orange-red color.

I hope you have found this helpful, and please don't hesitate to contact us with any questions you may have related to these blogs!

Keep on coding!

Top comments (0)