AWS re:Invent 2022 | AI and Machine Learning
One of the new announcements from AWS re:Invent 2022 in the AI and Machine Learning category was Amazon Comprehend Intelligent Document Processing (IDP) which supports other document file formats such as PDF, Microsoft Word documents
and images.
Background
The existing Amazon Comprehend natural language processing service can mine text files to perform analyses such as:
- Key Phrase
- Topic Modelling
- PII detection
- Sentiment Analysis
- Targeted Sentiment Analysis
- Custom Classification
- Custom Entity Recognition
My solution architecture for ingesting a single text file is provided below:
What are the business use cases for Amazon Comprehend?
- Customer Support Tickets
- Mine call centre analytics
- Extract customer sentiment, key phrases from customer surveys
- Analyze customer interactions
- Find key topics from customer feedback
- Classify and extract entities from documents
Learning Objectives
In this lesson, you will learn:
- How to classify and extract entities with the new feature from Amazon Comprehend IDP
Amazon Comprehend Intelligent Document Processing (IDP)
Amazon Comprehend Intelligent Document Processing (IDP) has greater flexibility to directly process text and extract custom entities from PDF, Microsoft Word documents and images.
Custom entity recognizers for PDF documents can only be used for English language as at 21 December 2022.
This is the reference architecture provided by Amazon Web Services:
At AWS re:invent 2022, PDF files can be used to extract insights from entities and classify documents.
What are the business use cases for Amazon Comprehend IDP?
The use cases include:
- Extract information from insurance claim forms
- Classify and extract entities from income statements to complete loan applications
- Real-time processing of documents (i.e. synchronous)
- Batch-processing of large documents (i.e. asynchronous)
- Classify and extract insights from legal documents
- Extract information from tax invoices
Dataset
This dataset contains German legal documents in PDF format that was downloaded from the website Papers with Code.
The legal documents consists of court decisions from 2017 and 2018 were selected for the dataset, published online by the Federal Ministry of Justice and Consumer Protection.
Citation:
@misc{https://doi.org/10.48550/arxiv.2003.13016,
doi = {10.48550/ARXIV.2003.13016},
url = {https://arxiv.org/abs/2003.13016},
author = {Leitner, Elena and Rehm, Georg and Moreno-Schneider, Julián},
keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {A Dataset of German Legal Documents for Named Entity Recognition},
publisher = {arXiv},
year = {2020},
copyright = {arXiv.org perpetual, non-exclusive license}
}
Two Approaches - Annotation of PDFs or Entity Lists
There are two methods to extract entities from PDF files:
Method 1: An advanced approach involves a few pre-requisites to set up the environment and label documents with Amazon Ground Truth and
Method 2: A simpler approach includes preparing training data via a plain text files in a csv file UTF-8 format with two headings 'Text' and 'Type'. With a minimum of 25 entities (i.e. Type) listed in the entity list
Tutorial 1: Method 1: Custom Entity Recognition with Amazon Comprehend Intelligent Document Processing (IDP) for German legal documents translated into English - with Entity Lists.
- Step 1: Sign into the AWS Management Console with IAM Administrator role.
Navigate to Amazon Comprehend and click Launch Amazon Comprehend.
- Step 2: Select Analysis jobs from the left hand-side and Create job.
- Step 3: Create analysis
Provide a name for the analysis, select from the drop-down menu
'Entities' as the analysis type.
Select 'English' as the language.
For the input document, select the training dataset from the S3 bucket and the file type as 'document per line' because it is a small file size.
Create an IAM role for Amazon Comprehend role and click Create job.
- Step 4: The analysis job will take a few minutes to complete processing.
Let's inspect the output of the analysis job:
Download and open the output file in gz format using 7-Zip. If you do not have 7-Zip you may access it here
- Step 5: Prepare training data entity list for building a custom recognition model
The csv file for preparing training data is in UTF-8 format with headers 'Text' and 'Type'. The 'Type' is in uppercase as per AWS documentation in preparing data.
- Step 6: Create a Custom Entity Recognition model.
In Amazon Comprehend click Create new model
Add custom entity types.
Upload 'entity list' as a csv file into Amazon S3 bucket as training data.
Upload 'german_test' csv file into the Amazon S3 bucket for the test data.
Navigate and select the second option for 'custom entity list' as a source of training data, select output location for trained data and select the custom test data that is stored in the Amazon S3 bucket.
Select the existing IAM role created for Amazon Comprehend and click Create.
Model is submitted for processing.
Tip 1: In the training data, remove any special characters e.g '/'.
Tip 2: Each custom entity must have a minimum occurrence of 25 times to be included in the plain text file of an entity list.
I updated the fourth model for training purposes.
After 15 minutes, the classifier model is trained.
The model produced a F1 score of 77.30 on the training data set.
Step 7: Create a custom entity detection analysis job (synchronous processing)
Finally, create an analysis job selecting 'custom entity recognition' as the analysis type. Select the test data from S3 bucket.
Select the output data location in the S3 bucket, select the IAM role and click Create job.
The custom entity analysis will be processed and take a few minutes to complete.
Let's inspect the output of the custom entity recognizer analysis for legal documents stored in the S3 bucket.
Results of Custom Entity Recognition model from Entity Lists
The output of Custom Entity Recognition model using test data on the trained classifier model produced the following results.
All the entity types extracted from the test data file recognized the input text with a 99% confidence level.
The exceptions were the last seven input text which had a lower confidence level in recognizing the entities for the German organizations from the legal documents.
Method 2: Pre-requisites - Checklist before Annotating PDF Files
Pre-requisite 1: Create a virtual environment to use the latest version of Python in AWS Cloud9
- Step 1: Create an AWS account for IAM Administrator - refer to my blog
Login to the AWS Management Console account as IAM Administrator.
- Step 2: Create a virtual environment to use the latest version of Python in AWS Cloud9.
Navigate and type in the search bar Cloud 9.
Select Create environment.
Create a name for the temporary environment and select the smallest EC2 instance i.e. t2.micro and then click Create.
- Step 3: Configure the virtual environment with your AWS credentials.
Open the Cloud 9 IDE and type configure your AWS virtual environment.
Type 'aws --version' to check the version of the AWS CLI.
And refer to this link to configure the AWS environment
Pre-requisite 2: Setting up the environment
- Step 1: Install cygwin for Windows by clicking this link
Step 2: Download the annotation files from Github
Step 3: Create a virtual environment in Python
In the Terminal of Visual Studio, type 'Python':
$ pip install virtualenv
$ pip install --upgrade pip
- Step 4: Unzip the Github files
Unzip the annotation files folder downloaded from Github in your IDE.
I used Visual Studio IDE by going to Extensions on the left hand-side menu and installing 'AWS Toolkit'
Go to View -> Colour Palette -> AWS Create Credentials:
Provide your profile details from IAM Administrator:
AWS Access Key ID:
AWS Secret Access Key:
Pre-requisite 3: Create an Amazon S3 bucket (using Amazon Management Console) -refer to my blog.
- Step 1: Create a S3 Bucket called 'src'.
Retain the default settings and select Create bucket.
- Step 2: Upload the train dataset in PDF format into the S3 bucket you have created.
Pre-requisite 4: Annotate PDF files
In the next lesson, we will annotate PDF files from the German legal document and complete the pre-requisites from here.
Resources
AWS re:Invent 2022 keynotes, workshops and leadership sessions
In case you missed AWS re:Invent 2022 a few weeks ago you can experience the excitement by learning about the latest innovation and product features here.
Until the next lesson, happy learning! 😁
Next Lesson: AWS re:Invent 2022 - Document AI: Classify German legal documents with Amazon Comprehend IDP - Part 2
Custom Entity Recognition Analysis with Amazon Comprehend Intelligent Document Processing (IDP) for German legal documents translated into English - for Annotated PDF files with labeling using Amazon Ground Truth.
Top comments (2)
A very informative post! Annotations for legal requirements require domain experts and an easy to use interface to annotate such unstructured texts. The docs can be in text or in a pdf file, so we have to make sure that we have the means to annotate the data in different formats. For my project, I use NLP Lab which is a free to use no-code platform for automated annotation and features like pre-annotation, building relations among entities, annotate data in pdf files, etc.
Thanks Divyanshu for reading the article and sharing with me your experience with the no-code platform that you used for your project. That's very insightful and I will check out this automated annotation tool.