DEV Community

Cover image for AWS re:Invent 2022 - Classify German legal documents with Amazon Comprehend IDP - Part 1
Wendy Wong for AWS Community Builders

Posted on


AWS re:Invent 2022 - Classify German legal documents with Amazon Comprehend IDP - Part 1

AWS re:Invent 2022 | AI and Machine Learning

One of the new announcements from AWS re:Invent 2022 in the AI and Machine Learning category was Amazon Comprehend Intelligent Document Processing (IDP) which supports other document file formats such as PDF, Microsoft Word documents
and images.


The existing Amazon Comprehend natural language processing service can mine text files to perform analyses such as:

  • Key Phrase
  • Topic Modelling
  • PII detection
  • Sentiment Analysis
  • Targeted Sentiment Analysis
  • Custom Classification
  • Custom Entity Recognition

My solution architecture for ingesting a single text file is provided below:


What are the business use cases for Amazon Comprehend?

  • Customer Support Tickets
  • Mine call centre analytics
  • Extract customer sentiment, key phrases from customer surveys
  • Analyze customer interactions
  • Find key topics from customer feedback
  • Classify and extract entities from documents

Learning Objectives

In this lesson, you will learn:

  • How to classify and extract entities with the new feature from Amazon Comprehend IDP

Amazon Comprehend Intelligent Document Processing (IDP)

Amazon Comprehend Intelligent Document Processing (IDP) has greater flexibility to directly process text and extract custom entities from PDF, Microsoft Word documents and images.

Custom entity recognizers for PDF documents can only be used for English language as at 21 December 2022.

This is the reference architecture provided by Amazon Web Services:


At AWS re:invent 2022, PDF files can be used to extract insights from entities and classify documents.


What are the business use cases for Amazon Comprehend IDP?

The use cases include:

  • Extract information from insurance claim forms
  • Classify and extract entities from income statements to complete loan applications
  • Real-time processing of documents (i.e. synchronous)
  • Batch-processing of large documents (i.e. asynchronous)
  • Classify and extract insights from legal documents
  • Extract information from tax invoices


This dataset contains German legal documents in PDF format that was downloaded from the website Papers with Code.

The legal documents consists of court decisions from 2017 and 2018 were selected for the dataset, published online by the Federal Ministry of Justice and Consumer Protection.


doi = {10.48550/ARXIV.2003.13016},
url = {},

author = {Leitner, Elena and Rehm, Georg and Moreno-Schneider, Julián},

keywords = {Computation and Language (cs.CL), Information Retrieval (cs.IR), FOS: Computer and information sciences, FOS: Computer and information sciences},

title = {A Dataset of German Legal Documents for Named Entity Recognition},

publisher = {arXiv},

year = {2020},

copyright = { perpetual, non-exclusive license}

Two Approaches - Annotation of PDFs or Entity Lists

There are two methods to extract entities from PDF files:

  • Method 1: An advanced approach involves a few pre-requisites to set up the environment and label documents with Amazon Ground Truth and

  • Method 2: A simpler approach includes preparing training data via a plain text files in a csv file UTF-8 format with two headings 'Text' and 'Type'. With a minimum of 25 entities (i.e. Type) listed in the entity list

Tutorial 1: Method 1: Custom Entity Recognition with Amazon Comprehend Intelligent Document Processing (IDP) for German legal documents translated into English - with Entity Lists.

  • Step 1: Sign into the AWS Management Console with IAM Administrator role.

Navigate to Amazon Comprehend and click Launch Amazon Comprehend.


  • Step 2: Select Analysis jobs from the left hand-side and Create job.


  • Step 3: Create analysis

Provide a name for the analysis, select from the drop-down menu
'Entities' as the analysis type.

Select 'English' as the language.


For the input document, select the training dataset from the S3 bucket and the file type as 'document per line' because it is a small file size.

Create iam

Create an IAM role for Amazon Comprehend role and click Create job.

  • Step 4: The analysis job will take a few minutes to complete processing.


Let's inspect the output of the analysis job:

Download and open the output file in gz format using 7-Zip. If you do not have 7-Zip you may access it here

  • Step 5: Prepare training data entity list for building a custom recognition model

The csv file for preparing training data is in UTF-8 format with headers 'Text' and 'Type'. The 'Type' is in uppercase as per AWS documentation in preparing data.


  • Step 6: Create a Custom Entity Recognition model.

In Amazon Comprehend click Create new model

Create new model

Add custom entity types.


Upload 'entity list' as a csv file into Amazon S3 bucket as training data.

up into s3

Upload 'german_test' csv file into the Amazon S3 bucket for the test data.

upload test

Navigate and select the second option for 'custom entity list' as a source of training data, select output location for trained data and select the custom test data that is stored in the Amazon S3 bucket.

model train

Select the existing IAM role created for Amazon Comprehend and click Create.

training modeln

Model is submitted for processing.


Tip 1: In the training data, remove any special characters e.g '/'.

Tip 2: Each custom entity must have a minimum occurrence of 25 times to be included in the plain text file of an entity list.

entity appear

I updated the fourth model for training purposes.

4th model

After 15 minutes, the classifier model is trained.


The model produced a F1 score of 77.30 on the training data set.


Step 7: Create a custom entity detection analysis job (synchronous processing)

Finally, create an analysis job selecting 'custom entity recognition' as the analysis type. Select the test data from S3 bucket.

Select the output data location in the S3 bucket, select the IAM role and click Create job.

The custom entity analysis will be processed and take a few minutes to complete.

model 2n

Let's inspect the output of the custom entity recognizer analysis for legal documents stored in the S3 bucket.


Results of Custom Entity Recognition model from Entity Lists

The output of Custom Entity Recognition model using test data on the trained classifier model produced the following results.

All the entity types extracted from the test data file recognized the input text with a 99% confidence level.

testing data

The exceptions were the last seven input text which had a lower confidence level in recognizing the entities for the German organizations from the legal documents.

last seven

Method 2: Pre-requisites - Checklist before Annotating PDF Files

Pre-requisite 1: Create a virtual environment to use the latest version of Python in AWS Cloud9

  • Step 1: Create an AWS account for IAM Administrator - refer to my blog

Login to the AWS Management Console account as IAM Administrator.

  • Step 2: Create a virtual environment to use the latest version of Python in AWS Cloud9.

Navigate and type in the search bar Cloud 9.

cloud 9

Select Create environment.


Create a name for the temporary environment and select the smallest EC2 instance i.e. t2.micro and then click Create.


  • Step 3: Configure the virtual environment with your AWS credentials.

Open the Cloud 9 IDE and type configure your AWS virtual environment.

open IDE

Type 'aws --version' to check the version of the AWS CLI.


And refer to this link to configure the AWS environment

Pre-requisite 2: Setting up the environment

  • Step 1: Install cygwin for Windows by clicking this link


  • Step 2: Download the annotation files from Github

  • Step 3: Create a virtual environment in Python

In the Terminal of Visual Studio, type 'Python':

$ pip install virtualenv

python virtual

$ pip install --upgrade pip

pip py

  • Step 4: Unzip the Github files


Unzip the annotation files folder downloaded from Github in your IDE.

I used Visual Studio IDE by going to Extensions on the left hand-side menu and installing 'AWS Toolkit'

Go to View -> Colour Palette -> AWS Create Credentials:

Provide your profile details from IAM Administrator:

AWS Access Key ID:
AWS Secret Access Key:


Pre-requisite 3: Create an Amazon S3 bucket (using Amazon Management Console) -refer to my blog.

  • Step 1: Create a S3 Bucket called 'src'.


Retain the default settings and select Create bucket.

  • Step 2: Upload the train dataset in PDF format into the S3 bucket you have created.


Pre-requisite 4: Annotate PDF files

In the next lesson, we will annotate PDF files from the German legal document and complete the pre-requisites from here.


AWS re:Invent 2022 keynotes, workshops and leadership sessions

In case you missed AWS re:Invent 2022 a few weeks ago you can experience the excitement by learning about the latest innovation and product features here.


Until the next lesson, happy learning! 😁

Next Lesson: AWS re:Invent 2022 - Document AI: Classify German legal documents with Amazon Comprehend IDP - Part 2

Custom Entity Recognition Analysis with Amazon Comprehend Intelligent Document Processing (IDP) for German legal documents translated into English - for Annotated PDF files with labeling using Amazon Ground Truth.

Top comments (0)

Timeless DEV post...

Git Concepts I Wish I Knew Years Ago

The most used technology by developers is not Javascript.

It's not Python or HTML.

It hardly even gets mentioned in interviews or listed as a pre-requisite for jobs.

I'm talking about Git and version control of course.

One does not simply learn git