Wendy Wong for AWS Community Builders

Posted on Dec 22, 2022

AWS re:Invent 2022 - Classify German legal documents with Amazon Comprehend IDP - Part 1

#aws #machinelearning #nlp #tutorial

AWS re:Invent 2022 | AI and Machine Learning

One of the new announcements from AWS re:Invent 2022 in the AI and Machine Learning category was Amazon Comprehend Intelligent Document Processing (IDP) which supports other document file formats such as PDF, Microsoft Word documents
and images.

Background

The existing Amazon Comprehend natural language processing service can mine text files to perform analyses such as:

Key Phrase
Topic Modelling
PII detection
Sentiment Analysis
Targeted Sentiment Analysis
Custom Classification
Custom Entity Recognition

My solution architecture for ingesting a single text file is provided below:

What are the business use cases for Amazon Comprehend?

Customer Support Tickets
Mine call centre analytics
Extract customer sentiment, key phrases from customer surveys
Analyze customer interactions
Find key topics from customer feedback
Classify and extract entities from documents

Learning Objectives

In this lesson, you will learn:

How to classify and extract entities with the new feature from Amazon Comprehend IDP

Amazon Comprehend Intelligent Document Processing (IDP)

Amazon Comprehend Intelligent Document Processing (IDP) has greater flexibility to directly process text and extract custom entities from PDF, Microsoft Word documents and images.

Custom entity recognizers for PDF documents can only be used for English language as at 21 December 2022.

This is the reference architecture provided by Amazon Web Services:

At AWS re:invent 2022, PDF files can be used to extract insights from entities and classify documents.

What are the business use cases for Amazon Comprehend IDP?

The use cases include:

Extract information from insurance claim forms
Classify and extract entities from income statements to complete loan applications
Real-time processing of documents (i.e. synchronous)
Batch-processing of large documents (i.e. asynchronous)
Classify and extract insights from legal documents
Extract information from tax invoices

Dataset

This dataset contains German legal documents in PDF format that was downloaded from the website Papers with Code.

The legal documents consists of court decisions from 2017 and 2018 were selected for the dataset, published online by the Federal Ministry of Justice and Consumer Protection.

Citation:

@misc{https://doi.org/10.48550/arxiv.2003.13016,
doi = {10.48550/ARXIV.2003.13016},
url = {https://arxiv.org/abs/2003.13016},

author = {Leitner, Elena and Rehm, Georg and Moreno-Schneider, Julián},

keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), FOS: Computer and information sciences, FOS: Computer and information sciences},

title = {A Dataset of German Legal Documents for Named Entity Recognition},

publisher = {arXiv},

year = {2020},

copyright = {arXiv.org perpetual, non-exclusive license}
}

Two Approaches - Annotation of PDFs or Entity Lists

There are two methods to extract entities from PDF files:

Method 1: An advanced approach involves a few pre-requisites to set up the environment and label documents with Amazon Ground Truth and
Method 2: A simpler approach includes preparing training data via a plain text files in a csv file UTF-8 format with two headings 'Text' and 'Type'. With a minimum of 25 entities (i.e. Type) listed in the entity list

Tutorial 1: Method 1: Custom Entity Recognition with Amazon Comprehend Intelligent Document Processing (IDP) for German legal documents translated into English - with Entity Lists.

Step 1: Sign into the AWS Management Console with IAM Administrator role.

Navigate to Amazon Comprehend and click Launch Amazon Comprehend.

Step 2: Select Analysis jobs from the left hand-side and Create job.

Step 3: Create analysis

Provide a name for the analysis, select from the drop-down menu
'Entities' as the analysis type.

Select 'English' as the language.

For the input document, select the training dataset from the S3 bucket and the file type as 'document per line' because it is a small file size.

Create an IAM role for Amazon Comprehend role and click Create job.

Step 4: The analysis job will take a few minutes to complete processing.

Let's inspect the output of the analysis job:

Download and open the output file in gz format using 7-Zip. If you do not have 7-Zip you may access it here

Step 5: Prepare training data entity list for building a custom recognition model

The csv file for preparing training data is in UTF-8 format with headers 'Text' and 'Type'. The 'Type' is in uppercase as per AWS documentation in preparing data.

Step 6: Create a Custom Entity Recognition model.

In Amazon Comprehend click Create new model

Add custom entity types.

Upload 'entity list' as a csv file into Amazon S3 bucket as training data.

Upload 'german_test' csv file into the Amazon S3 bucket for the test data.

Navigate and select the second option for 'custom entity list' as a source of training data, select output location for trained data and select the custom test data that is stored in the Amazon S3 bucket.

Select the existing IAM role created for Amazon Comprehend and click Create.

Model is submitted for processing.

Tip 1: In the training data, remove any special characters e.g '/'.

Tip 2: Each custom entity must have a minimum occurrence of 25 times to be included in the plain text file of an entity list.

I updated the fourth model for training purposes.

After 15 minutes, the classifier model is trained.

The model produced a F1 score of 77.30 on the training data set.

Step 7: Create a custom entity detection analysis job (synchronous processing)

Finally, create an analysis job selecting 'custom entity recognition' as the analysis type. Select the test data from S3 bucket.

Select the output data location in the S3 bucket, select the IAM role and click Create job.

The custom entity analysis will be processed and take a few minutes to complete.

Let's inspect the output of the custom entity recognizer analysis for legal documents stored in the S3 bucket.

Results of Custom Entity Recognition model from Entity Lists

The output of Custom Entity Recognition model using test data on the trained classifier model produced the following results.

All the entity types extracted from the test data file recognized the input text with a 99% confidence level.

The exceptions were the last seven input text which had a lower confidence level in recognizing the entities for the German organizations from the legal documents.

Method 2: Pre-requisites - Checklist before Annotating PDF Files

Pre-requisite 1: Create a virtual environment to use the latest version of Python in AWS Cloud9

Step 1: Create an AWS account for IAM Administrator - refer to my blog

Step 2: Create a virtual environment to use the latest version of Python in AWS Cloud9.

Navigate and type in the search bar Cloud 9.

Select Create environment.

Create a name for the temporary environment and select the smallest EC2 instance i.e. t2.micro and then click Create.

Step 3: Configure the virtual environment with your AWS credentials.

Open the Cloud 9 IDE and type configure your AWS virtual environment.

Type 'aws --version' to check the version of the AWS CLI.

And refer to this link to configure the AWS environment

Pre-requisite 2: Setting up the environment

Step 1: Install cygwin for Windows by clicking this link

Step 2: Download the annotation files from Github
Step 3: Create a virtual environment in Python

In the Terminal of Visual Studio, type 'Python':

$ pip install virtualenv

$ pip install --upgrade pip

Step 4: Unzip the Github files

Unzip the annotation files folder downloaded from Github in your IDE.

I used Visual Studio IDE by going to Extensions on the left hand-side menu and installing 'AWS Toolkit'

Go to View -> Colour Palette -> AWS Create Credentials:

Provide your profile details from IAM Administrator:

AWS Access Key ID:
AWS Secret Access Key:

Pre-requisite 3: Create an Amazon S3 bucket (using Amazon Management Console) -refer to my blog.

Step 1: Create a S3 Bucket called 'src'.

Retain the default settings and select Create bucket.

Step 2: Upload the train dataset in PDF format into the S3 bucket you have created.

Pre-requisite 4: Annotate PDF files

In the next lesson, we will annotate PDF files from the German legal document and complete the pre-requisites from here.

Resources

AWS re:Invent 2022 keynotes, workshops and leadership sessions

In case you missed AWS re:Invent 2022 a few weeks ago you can experience the excitement by learning about the latest innovation and product features here.

Until the next lesson, happy learning! 😁

Next Lesson: AWS re:Invent 2022 - Document AI: Classify German legal documents with Amazon Comprehend IDP - Part 2

Custom Entity Recognition Analysis with Amazon Comprehend Intelligent Document Processing (IDP) for German legal documents translated into English - for Annotated PDF files with labeling using Amazon Ground Truth.

Oldest comments (2)

Divyanshu Katiyar • May 12 '23

A very informative post! Annotations for legal requirements require domain experts and an easy to use interface to annotate such unstructured texts. The docs can be in text or in a pdf file, so we have to make sure that we have the means to annotate the data in different formats. For my project, I use NLP Lab which is a free to use no-code platform for automated annotation and features like pre-annotation, building relations among entities, annotate data in pdf files, etc.

Wendy Wong AWS Community Builders • May 13 '23

Thanks Divyanshu for reading the article and sharing with me your experience with the no-code platform that you used for your project. That's very insightful and I will check out this automated annotation tool.