DEV Community

Andriy Andruhovski for Aspose.PDF

Posted on • Updated on

Extract the text data from PDF file using Aspose.PDF for .NET

While dealing with Portable Document Format files, at times, you might need to extract text from a PDF file.
Aspose.PDF several classes to extract the data:

The easiest way to extract the data from PDF is using TextFragmentAbsorber with the default options:

TextAbsorber performs text extraction and provides access to the result via Text object. In this case, we'll get all text data in one single object.
Call the Accept method on a particular page of the Document object. The Index is the particular page number from where text needs to be extracted.

Sometimes we need to extract the text from the particular area (i.e. the left upper corner of the page). TextAbsorber also can do it. We'll need to setup TextSearchOptions property. In the following example, we'll set up a LimitToPageBounds property and a Rectangle property. The last takes Rectangle object as a value and using this property, we can specify the region of the page from which we need to extract the text. In our example, the LimitToPageBounds property indicates that text is searched within the page bound and the Rectangle property indicates to the upper half of page.

The TextFragmentAbsorber object is basically used in text search scenario. When the search is completed the occurrences are represented as text fragments collection. The TextFragment object provides access to the search occurrence text, text properties, and allows to edit text and change the text state (font, font size, color etc).

The ParagraphAbsorber class performs the search for sections and paragraphs of text and provides access for rectangles and polygons that describe it in text coordinate space.

Top comments (3)

Collapse
 
smithharber profile image
smithharber • Edited

Are you searching for a conversion solution to Import PDF file to Text effortlessly? Use CubexSoft PDF to Text Tool Converter for this purpose which give perfect and exact solution of how to convert a PDF emails to Text. There is no conversion issue you can simply export PDF files into Text file format. The software is a desktop based application which supports all version of Windows i.e. 11, 10, 8, 7, 8.1, vista etc. If you want to grab more knowledge about the software working, download PDF to TXT Tool demo version. The software demo version allow convert of first 5 PDF emails to Text for free of cost.

Collapse
 
mamtacd profile image
mamtacd

Hi Team,
I need to extract content from PDF, by giving a paragraph heading or some phrase.
How to achieve this. ParagraphAbsober, does get all text. However I need only from a particular paragraph or particular portion of a paragraph, not the complete page.
How to achieve this.
Regards,
Mamtha.A.C.D.

Collapse
 
andruhovski profile image
Andriy Andruhovski

Thanks for your interest!
Currently, you can use TextFragementAbsorber with regular expression as an input parameter.

    // Create TextFragmentAbsorber object that searches all words starting 'h' 
    // and ending 'o' using regular expression.
    TextFragmentAbsorber absorber = new TextFragmentAbsorber(@"h\w*?o", 
         new TextSearchOptions(true));

Unfortunately, ParagraphAbsorber doesn't support search by the regular expression, so you need to analyze paragraphs extracted with this tool manually.