Ever felt like you're wrestling with a stubborn PDF, desperately trying to extract that crucial information? We've all been there. PDFs are fantastic for document portability, but their structure can often make it a nightmare to get the data you need.
Imagine this: you've just received a massive repository of PDF documents from a client, filled with valuable data that needs to be analysed and utilized. You're eager to dive in, but there's a catch : half of these PDFs are scanned, and extracting meaningful information from them seems like a Herculean task.
Sound familiar?
Whether you're in finance, legal, healthcare, or any data-intensive field, the challenge of extracting data from PDFs is a common hurdle. Traditional methods often fall short, leaving you frustrated with manual data entry and limited extraction tools that can't handle the complexities of scanned documents.
The Struggle with PDF Extraction
You might have perfect digital PDFs where text extraction is straightforward. But then you hit a wall with scanned PDFs, where the text is essentially locked in images. This is not just inconvenient but also incredibly time-consuming and prone to errors. The quality of data suffers, and your efficiency takes a hit.
The Game-Changer: Adobe Extract API Service
In our quest for an efficient solution, we have explored numerous services and tools, such as AWS Textract, which automatically extracts text and data from scanned documents, and Unstructured, which provides open-source solutions for processing and analysing unstructured data,each promising to simplify PDF data extraction . However, the real breakthrough came when we discovered the Adobe Extract API service. This service didn't just meet our expectations; it exceeded.
The Adobe Extract API service is designed to handle the complexities of both digital and scanned PDFs with remarkable accuracy. It seamlessly ingests PDFs and extracts text, tables, and images, turning even the most stubbornly scanned documents into actionable data. Additionally, Adobe Extract is well-suited for handling multi-column layouts, such as those found in newsletters. The Adobe Extract API service provided a reliable, efficient solution that saved us countless hours and significantly improved our data quality.
No more delays, let's get startedβΌ
What exactly is the Adobe Extract API?
Think of it as your personal PDF sherpa, guiding you through the maze of document structures. It's a cloud-based web service powered by Adobe Sensei, Adobe's industry-leading AI and Machine Learning platform.Although the Adobe PDF Extract API itself is not open source, it provides open-source SDKs for various programming languages, including Java and Python, which can be used to integrate the API into applications.
Here's the magic: Adobe Sensei AI dives deep into each page, deciphering layouts, structures, and even nuances in text and images. It doesn't just stop at simple text extraction; it's capable of tackling complex tables, figures, and more, all while maintaining accuracy and precision.
The best part? Once the Adobe Extract API works its magic, it transforms everything into a neatly organised JSON format, ready for you to dive into. It works well with both native and scanned documents.
Imagine a world where extracting customer information from invoices, product details from brochures, or financial data from reports becomes a breeze. That's the power the Adobe Extract API puts in your hands.
Please refer to Adobe's official GitHub repository for the full code implementation.This sample code example provides a valuable starting point:
ExtractPDFParams extractPDFParams = ExtractPDFParams.extractPDFParamsBuilder()
.addElementsToExtract(Arrays.asList(ExtractElementType.TEXT, ExtractElementType.TABLES))
.addElementsToExtractRenditions(Arrays.asList(ExtractRenditionsElementType.TABLES, ExtractRenditionsElementType.FIGURES))
.build();
This section of the code is where you specify which elements you want to extract from the PDF document. It's like setting up a blueprint for the extraction process. Here, you can add different types of elements, such as text, tables, or images based on your requirements. By configuring these parameters, you ensure that the Adobe Extract API focuses on extracting the specific elements you're interested in, making the extraction process more targeted and efficient.
This is the sample PDF link that you can use to compare the below outputs:
- It extracts the text, tables and figures within the PDF.
- Finally, it converts the extracted tables into a CSV format, making the data easier to work with.
1.Extended metadata:
"extended_metadata": {
"ID_instance": "11 B0 4E 31 FA B9 B2 11 0A 00 67 45 8B 6B C6 23 ",
"ID_permanent": "45 46 20 44 30 20 34 42 20 33 31 20 46 41 20 42 39 20 42 32 20 31 31 20 30 41 20 30 30 20 36 37 20 34 35 20 38 42 20 36 42 20 43 36 20 32 33 20 ",
"has_acroform": false,
"has_embedded_files": false,
"is_XFA": false,
"is_certified": false,
"is_encrypted": false,
"is_digitally_signed": false,
"language": "en",
"page_count": 4,
"pdf_version": "1.6",
"pdfa_compliance_level": "",
"pdfua_compliance_level": ""
}
The provided JSON object contains metadata about the PDF file, including a unique instance identifier, a permanent identifier, and flags indicating various properties such as the presence of interactive forms, embedded files, XML Forms Architecture, certification, encryption, digital signatures, and compliance with PDF/A and PDF/UA standards. The metadata also specifies the language used in the PDF, the number of pages, and the version of the PDF standard employed. This detailed information helps in understanding the structure, security, and compatibility of the PDF file.
2.Elements
Section:Text
{
"Bounds": [
57.635894775390625, 460.58860778808594, 123.0447998046875,
480.5798645019531
],
"Font": {
"alt_family_name": "Arial",
"embedded": true,
"encoding": "Custom",
"family_name": "Arial",
"font_type": "TrueType",
"italic": false,
"monospaced": false,
"name": "HOEPNL+Arial,Bold",
"subset": true,
"weight": 700
},
"HasClip": false,
"Lang": "en",
"ObjectID": 312,
"Page": 0,
"Path": "//Document/Sect[2]/H1",
"Text": "Overview ",
"TextSize": 14.038803100585938,
"attributes": { "LineHeight": 16.875 }
},
{
"Bounds": [
57.635894775390625, 430.2440490722656, 522.2810974121094,
457.99635314941406
],
"Font": {
"alt_family_name": "Arial",
"embedded": true,
"encoding": "Custom",
"family_name": "Arial",
"font_type": "TrueType",
"italic": false,
"monospaced": false,
"name": "HOEPAP+Arial",
"subset": true,
"weight": 400
},
"HasClip": false,
"Lang": "en",
"ObjectID": 313,
"Page": 0,
"Path": "//Document/Sect[2]/P",
"Text": "This sample consists of a simple form containing four distinct fields. The data file contains eight separate records. ",
"TextSize": 11.039093017578125
}
The provided JSON object includes details such as the element's spatial coordinates, font characteristics, and other attributes that help identify and style the element. The object contains the actual text value of the element, which provides a clear understanding of the content within that section.
The Path property in the provided JSON object represents an XPath expression that identifies the location of the text element within the PDF document structure. Specifically:
- /Sect matches the second Sect (section) element under the Document element.
- /H1 matches the H1 (heading 1) element under the second section.
So this XPath expression selects the H1 heading element that is the child of the second Sect element in the PDF document. This allows the text element to be precisely located and referenced within the document's hierarchy.
Section:Table/Figures
{
"Bounds": [
63.39500427246094, 499.7163848876953, 433.54026794433594,
629.3433837890625
],
"ObjectID": 386,
"Page": 0,
"Path": "//Document/Sect/Table",
"attributes": {
"BBox": [
56.757699999998295, 496.57199999998556, 514.8989999999758,
635.045999999973
],
"NumCol": 2,
"NumRow": 4,
"Placement": "Block",
"SpaceAfter": 18
},
"filePaths": ["tables/fileoutpart0.csv", "tables/fileoutpart1.png"]
}
The provided JSON object includes attributes that describe the table's structure, layout and also file paths for the table, which are easy to map with corresponding section data:
- tables/fileoutpart0.csv
- tables/fileoutpart1.png
These file paths contain data or images related to tables in .csv and figures in .png which can be used to enhance or extend the content of elements within the document.
Thus, the API captures the natural reading order of the extracted elements and their layout on each page. This helps you understand the overall context of the extracted data.
3.Pages
{
"boxes": {
"CropBox": [0.0, 0.0, 612.0, 792.0],
"MediaBox": [0.0, 0.0, 612.0, 792.0]
},
"height": 792.0,
"is_scanned": false,
"page_number": 0,
"rotation": 0,
"width": 612.0
}
This object indicates whether the page was scanned or not, its page number within the document, and its rotation angle.
PDF Processing 101: Understanding the Limits!!
Adobe's API has some limitations when it comes to processing PDF files.Here are few,
- File Size: Files up to 100 MB are supported, so you can keep your files lean and mean.
- Number of Pages: Non-scanned PDFs can handle up to 400 pages, while scanned PDFs are limited to 150 pages or less.
- For files that are bigger than a house or have a crazy layout, it's best to break them up into smaller chunks before processing.
If your PDF is a bit on the heavy side, don't worry Adobe's got your back. They offer a way to delete pages, so you can give your file a makeover and make it fit for processing.
Donβt forget that deleting pages also costs you time and money!!
Reference:
Top comments (5)
Good article with needed information!
I understand that the article describes a process where uploading a PDF containing images, text, etc., will extract the data. This includes saving images as PNGs, converting tables to CSV files, and converting text data into JSON.
Then how this data is used for Retrieval-Augmented Generation (RAG). For RAG applications, the data needs to be stored in a vector database.
Then how will we pass this data to Vector Database?
Great question!
This is the first step of extracting data from any kind of PDFs.
And there are few steps involved before ingesting into vectorDB, you'll need to perform preprocessing steps like section-based text appending, which will help you combine meaningful sections together,get the table datas stored in csv For instance, if your use case doesn't require any specific sections from the PDF, you can omit them during preprocessing before it does to VectorDB.
Once you done with this, then you are ready to ingest the data into vectorDB depending on your chunking strategy and use it for RAG application.
Let me know if you have any further questions.
Oh, okay, great! Are there any libraries or tools to help with this process?
No, we used custom code for preprocessing
Great article. We were looking for a use case for our JSON comparison tool, but the PDF to JSON conversion from the Extract API for our sample documents (legal docs and invoices) produced quite messy results. One of the key constraints in the output was the use of bounding boxes for sections of content. Perhaps we are missing something? Do Adobe authoring tools allow for marked up content to avoid placing within a page layout structure?