DEV Community

Cover image for Extract table data from Documents using Azure AI Document Intelligence
Matthew Collinge
Matthew Collinge

Posted on

Extract table data from Documents using Azure AI Document Intelligence

Extract table data from Documents Using Azure AI Document Intelligence Layout Model with Power Automate

Document Intelligence Studio
Azure AI Document Intelligence is a cloud-based service that allows you to build intelligent document processing solutions. It uses machine learning to extract text, tables, and key-value pairs from documents, providing structured data for easy use.

In this post, I’m going to go through how to send a PDF document using an Azure AI Document Intelligence REST API request and return the text and table information using the Layout model. I’ll go through some of my processes for transforming the received JSON output into an array that can be used to create rows in a Dataverse table, all within Power Automate.

Power Automate isn’t Power Query, so I will use some techniques to get it to try and replicate some of the filtering, cleansing, and sorting capabilities of Power Query. Some of the filtering can be quite specific to the documents you have, so I will start quite general and then cover some techniques to cleanse the data by type alongside.

Choosing a Model

  • I would highly recommend that you use the Azure AI Document Intelligence Studio to find the most suitable model for your documents. You can try out models in the studio with your own documents. I did this and was confused about how you can actually use the model. The studio provides code examples, but you are kind of left to it from then on. When I’d found the REST API Documentation I certainly felt more comfortable, as I can use this with Power Automate!

  • Test multiple documents as you may discover inconsistencies in the results. To get around these, you can adjust some of the actions and expressions in your Flow to deal with them.

My Example

Example PDF Preview
I have created a simple document that includes one table and I like sausages, but in the real world, I have dealt with documents of all shapes and sizes. For example, where the tables start on a different page (we can append &pages= to the URI to extract a specific page 🙂).

Passing the Document to Azure AI Document Intelligence

I have to start by mentioning that I could not have done this first part without the help of this YouTube video from the excellent Damien Bird on Power Automate Invoice Processing Tutorial AI Builder and Azure. Here Damien explores the AI Builder option and compares the costs, introduces the Document Intelligence Service on Azure, polling the services, reviewing the Response, and selecting data from the response.

We’re basically going to do the same, but select the Layout Model API Version 2023-07-31 from Azure AI Document Intelligence.

Azure Services

Before we start building our Flow, we will need some Azure Services set up:

  • Resource Group
  • Azure AI Services > Document Intelligence (Select the F0 Pricing tier for up to 500 documents per month for free, much better than AI Builder).
  • Azure Key Vault Secret

Make sure to copy the Endpoint, as we will need this in our Flow. To keep things secure keep your key inside Azure Key Vault. I can recommend following this video Power Automate +Azure Key Vault - get key and secret and pass to Flow from Sean Astrakhan on how to do this.

Flow Structure

Ok, let’s get on with the Flow!
Flow Overview

Manual Trigger

Start with a manual trigger with a File Content Input. This will be used to bring the PDF Document into the Flow.
Manual Trigger

and then set up some variables....

Initialize variable – Status - We will need to use this String Variable to set the outcome of the second HTTP request status.

Initialize variable – Metadata - We will use this Object Variable to contextualise and reference the Document that we have extracted from.

Initialize variable - ParsedCellsArray - We will use this Array Variable to loop through the table cells.

Azure Key Vault Get secret

Enter the name of your Azure Key Vault secret to return its value. In settings, enable ‘Secure inputs’ and ‘Secure outputs’ to ensure the secret is hidden.

HTTP Request

This is where the fun starts. Add an HTTP Action.

URL: {Your Azure AI Service Endpoint}/formrecognizer/documentModels/prebuilt-layout:analyze?api-version=2023-07-31

There’s some more information here on versions and models. Here I have chosen the Model: prebuilt-layout and the api-version of 2023-07-31 – which is GA at the time of writing. It’s worth noting that I struggled to use the newer preview models due to my Azure AI services location being UK South (we have to wait for the new toys!).

Method: POST
Headers: Ocp-Apim-Subscription-Key: {Azure Key Vault Secret Value}
Body:
{"base64Source": triggerBody()?['file']?'contentBytes']}

For the body, we are passing the '‘File content contentBytes’' value from the Manual Trigger. This converts the file into base64 format so that it can be uploaded as part of the request.
HTTP Request
Now we have a way to send the Document, we then need a way to wait until it has been processed.

Do Until

Create a Do Until polling loop using Loop Until so that we can wait for the service to process the document and send back a response.

Do until

Variable Status is equal to “succeeded”, with a Count of 5 to limit the number of retries, with a 1 hour timeout.

Within the Do Until we need:

Delay: 10 Second Delay

HTTP_1: This HTTP action uses the ‘Operation-Location’ Header from the first HTTP action to check if the service has completed.

URL: outputs('HTTP')?['headers/Operation-Location']
Method: GET
Headers: Ocp-Apim-Subscription-Key: {Azure Key Vault Secret Value}
HTTP_1

Set the Status Variable:

Value: body('HTTP_1')?['status']

After the Do Until:

body('HTTP_1')?['analyzeResult']
Now we have a response from the Azure AI Document Intelligence Model, we can start to deal with the JSON that we get back. As Damien Bird mentions, it is helpful to visualise this output using https://codebeautify.org/jsonviewer to give yourself an idea of the output structure and how to reference and extract the data in your Flow.

Extract the Document Information

I have decided that it is important to extract some of the headers from the document as well as the table data in order to give context to the information.

Parse JSON – Paragraphs

Layout Model Paragraphs
As the Layout model gives us the text in the body('HTTP_1')?['analyzeResult/paragraphs'] we can just select that.

Content: body('HTTP_1')?['analyzeResult/paragraphs']
Schema:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "content": {
        "type": "string"
      }
    },
    "required": ["content"]
  }
}
Enter fullscreen mode Exit fullscreen mode

As we only need the Content properties, I have shortened the Schema so that only the ['content'] is returned in the output.
Parse JSON - Paragraphs

Set variable - Metadata Object

Name: Metadata
Value:

{
  "Report_Name": "@{body('Parse_JSON_-_Paragraphs')['content']}",
  "Issue_No": "@{first(split(\r\nbody('Parse_JSON_-_Paragraphs')['content'],'\n'))}",
  "Issue_Date": "@{last(split(\r\nbody('Parse_JSON_-_Paragraphs')['content'],'\n'))}"
}
Enter fullscreen mode Exit fullscreen mode

Here we use the index of the ['content'] property to extract the values that we want. This is where it is helpful to preview the output using https://codebeautify.org/jsonviewer so that you can work out the index beforehand. In my example, the Report Name is the 2nd value [1] and then I have used a Split() expression on the "|" to allow me to capture the first() and last() values either side of the “|” giving us the Issue_No and Issue_Date.

This is just an example, you will want to adjust the structure of your output to meet your own needs. We have captured this as an object here so that we can then align it alongside the table rows later.

Extracting the table rows

Layout Tables

Parse JSON

We can now use a Parse JSON to extract all the Table data by selecting the child property /tables.

Parse JSON

Content: body('HTTP_1')?['analyzeResult/tables']
Schema:

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "columnCount": {
        "type": "integer"
      },
      "cells": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "rowIndex": {
              "type": "integer"
            },
            "columnIndex": {
              "type": "integer"
            },
            "content": {
              "type": "string"
            }
          },
          "required": ["rowIndex", "columnIndex", "content"]
        }
      }
    },
    "required": ["columnCount", "cells"]
  }
}
Enter fullscreen mode Exit fullscreen mode

This is the full schema as we do need a few of the properties.

For each – Table

Input: body('Parse_JSON')
If you have multiple tables to go through, this is useful; if not, you can skip this action.
For each - Table

Set variable ParsedCellsArray

Set the ParsedCellsArray variable with 'cells' - this is the tables cells.
Name: ParsedCellsArray
Value: items('For_each-Table')['cells']

Select

So, if we were to just use....

From: "@variables('ParsedCellsArray')",

Map:

{
 "content" : @item()?['content'],
 "rowIndex" : @item()?['rowIndex'],
 "columnIndex" : @item()?['columnIndex']
}
Enter fullscreen mode Exit fullscreen mode

Then we would get each cell on top of each other in separate objects,
Image description
which we would have to somehow ZigZag across or remap, as we want every object in our array to contain a Name and a Price.

Standard Select Power Query

To do this we need to integrate a Range into the Select action using:

range(0, length(variables('ParsedCellsArray')))

to generate an index to iterate over the array.

From

In order to get the value of the adjacent column in the Select action, we can use range(startIndex: integer, count: integer) and use length() to use the length of ParsedCellsArray. This allows the Select action to iterate over each element in **ParsedCellsArray **and generate a sequence of numbers.

Map

This number sequence is then used in the Map for the last Property to get the next element “nextContent”.

nextContent:

@{if(less(item(), sub(length(variables('ParsedCellsArray')), 1)), variables('ParsedCellsArray')[add(item(), 1)]?['content'], null)}

This expression uses the If() function to check if the current index item() is less than the last index of the array sub(length(variables('ParsedCellsArray')), 1)).

  • If true, it retrieves the content of the next element variables('ParsedCellsArray')[add(item(), 1)]?['content']).
  • If false (i.e., the current element is the last one), it returns null.

So to put this all together we get:

From: range(0, length(variables('ParsedCellsArray')))
Map:

{
  "content": "@{variables('ParsedCellsArray')[item()]?['content']}",
  "rowIndex": "@{variables('ParsedCellsArray')[item()]?['rowIndex']}",
  "columnIndex": "@{variables('ParsedCellsArray')[item()]?['columnIndex']}",
  "isColumnHeader": "@{equals(variables('ParsedCellsArray')[item()]?['kind'], 'columnHeader')}",
  "Index": "@{item()}",
  "nextContent": "@{if(less(item(), sub(length(variables('ParsedCellsArray')), 1)), variables('ParsedCellsArray')[add(item(), 1)]?['content'], null)}"
}
Enter fullscreen mode Exit fullscreen mode

Select

The outcome of this starts to line up for us, but still needs Filtering... For example we only need the first Column and we can ignore the column headers.
WorkingExport - Power Query

Filter Array

We can use Filter Array to filter the returned data. This is especially useful if the tables contain some data that we don’t want, such as comments or notes. Rather than using a single, basic filter, we need to switch to advanced mode 🤓 to combine the multiple conditions we want.
You may want to use other conditions inside the and() expression to fine-tune your data.

From: body('Select')
Advanced Filter:

@and(
    equals(item()?['columnIndex'],0), //Only use the first column
    equals(item()?['isColumnHeader'], false), //Ignores the Column Header
    not(equals(item()?['content'], '')), //No Blanks
    not(equals(item()?['nextContent'], '')), //No Blanks
)
Enter fullscreen mode Exit fullscreen mode

Filter Array

Select – Filtered Table

We can merge in some of the metadata extracted from the Document, which we can now align in the context of each record in the array. The first three in my example are from the Metadata Object created earlier. You can also use this opportunity to further cleanse the values returned using string expressions if you need.

From: body('Filter_Array')
Map:

{
  "Report_Name": @{variables('Metadata')?['Report_Name']},
  "Issue_No": @{variables('Metadata')?['Issue_No']},
  "Issue_Date": @{variables('Metadata')?['Issue_Date']},
  "Product": item()?['content'],
  "Price": item()?['nextContent']
}
Enter fullscreen mode Exit fullscreen mode

Select - Filtered Table

Set variable - Filtered TransformedArray

Value: body('Select_-_Filtered_Table')
We will capture the table data into an array so that we can use it elsewhere. Here you can do whatever you want with your array. If you want to use an Add a new row Dataverse action, you can map each array property in the action for example.

Conclusion

Thank you for reading this post. Hopefully, this is useful to anyone who needs to use this particular Azure AI Document Intelligence Model to process their Documents and place the data into a more organised format. Please leave feedback in the Comments if you found this useful, or if there are any improvements you could recommend. Please experiment with it too, I'd love to hear some of the outcomes.

Top comments (1)

Collapse
 
ngtduc693 profile image
Duc Nguyen Thanh

very good, I marked your post as a high-quality post to be visible to more people