DEV Community

loading...

AWS Textract

raulhg profile image Raul ・4 min read

The idea of this post it to give a brief introduction in how the service works, specially in which operations we have available. This is the kind of post I wish I had before I started playing around with the service.

Alt Text

You can basically perform 2 type of operations in AWS Textract: synchronous and asynchronous. You really need to have some factors in consideration, we'll see this more in detail later, but independently of the SDK that you're using, the available calls to get text from a document are:

Synchronous:

  • AnalyzeDocument
  • DetectDocumentText

Asynchronous:

  • StartDocumentAnalysis
  • GetDocumentAnalysis

  • StartDocumentTextDetection

  • GetDocumentTextDetection

As you can see, we can also do a different categorisation here:

Document Analysis:

  • AnalyseDocument (sync call)
  • StartDocumentAnalysis and GetDocumentAnalysis (async)

Document Text Detection:

  • DetectDocumentText (sync call)
  • StartDocumentTextDetection and GetDocumentTextDetection (async)

Why is this important? Textract allows you to extract not only text, but also the the relations between components in the document. It also distinguish between single page documents and multi page documents. And last but not least, it's not the same to extract text from images than from pdf's.

In this post we'll focus more in getting the text from the documents, and less in getting the relations between components. I personally see the Text Detection as a subset of what Text Analysis does: it gets the raw text, and a bit more (table and form detection). It is also cheaper.

Analyze Document

First thing to note, if you're using any of the sync operations, you can only process images (JPEG/PNG). If you need to process PDFs, you must use async executions.

Also, you can only detect text from single page documents. If you want to extract text from multiple page documents, again, you need to use async calls.

As you can imagine, the sync operations wait synchronously until the service returns a response, so your code needs to wait until the response is back.

Let's see an example, we have this image uploaded in S3:

Alt Text

Let's send it to Textract.

import boto3
text_cli = boto3.client('textract')

resp = text_cli.detect_document_text(
    Document={
        'Bytes': b'bytes',
        'S3Object': {
            'Bucket': 'bucket_name',
            'Name': 'path/to/s3/object.png'
        }
    }
)
Enter fullscreen mode Exit fullscreen mode

You can specify either the byte representation of the image or the S3 location of your document as part o the Document parameter. In any case the result should be the same.

{
  'Blocks': [{
    'BlockType': 'PAGE',
    ...
    } , {
    'BlockType': 'LINE',
    ...
    } , {
    'BlockType': 'WORD',
    ...
    }]
}
Enter fullscreen mode Exit fullscreen mode

This is a subset of the response. As we can see, there's 1 block PAGE with no text in it (just some references to polygons and bounding boxes), and many blocks LINE and WORD. Let's see what these blocks contain.

print(">> For lines:")
for block in resp["Blocks"]:
    if block["BlockType"] == "LINE":
        print(block["Text"])

print("\n>> For words:")
for block in resp["Blocks"]:
    if block["BlockType"] == "WORD":
        print(block["Text"])
Enter fullscreen mode Exit fullscreen mode

The output of the above is:

>> For lines:
VIDEOGAME
RULE #107
PRESSING
THE BUTTONS
HARDER
MAKES THE ATTACK
STRONGER

>> For words:
VIDEOGAME
RULE
#107
PRESSING
THE
BUTTONS
HARDER
MAKES
THE
ATTACK
STRONGER
Enter fullscreen mode Exit fullscreen mode

The block PAGE has no Text in it, so the service returns the text contained in the image per lines and per independent words.

Async Calls

Async calls introduces a bit more complexity to the process. It allows text extraction from PDFs, multiple page documents and there's no need to wait actively for the response, which is also suitable if we are using the service from serverless applications. Generally, this is the recommended way to go forward.

The async operations just returns a job ID, and when it's finished, it will write the response to an SNS topic when the job is completed.

import boto3
text_cli = boto3.client('textract')

start_dtd = client.start_document_text_detection(
    DocumentLocation={
        'S3Object': {
            'Bucket': 'bucket_name',
            'Name': 'path/to/s3/object.pdf'
        },
        NotificationChannel={
            'SNSTopicArn': 'arn::sns/topic',
            'RoleArn': 'arn::iam/role'
        }
    }
)
Enter fullscreen mode Exit fullscreen mode

This API call has multiple parameters, check them out in the documentation to see what can possibly be better for your use case. You can send the result to an SNS topic which triggers a lambda function to process it. Or that SNS can write into a SQS queue.

In this example we'll pull the JobId to get the result until it's completed.

kwargs = {"JobId": start_dtd["JobId"]}

job_finished = False
while not job_finished:
    get_dtd = text_cli.get_document_text_detection(**kwargs)
    if get_dtd["JobStatus"] == "SUCCEEDED":
        job_finished = True

    print(get_dtd["JobStatus"] + "...")
    time.sleep(5)

Enter fullscreen mode Exit fullscreen mode

Once the job is done, we'll iterate over the pages (the response is paginated) and the blocks in order to all of the text.

has_token = True
blocks = []
while has_token:
    get_dtd = text_cli.get_document_text_detection(**kwargs)
    blocks += get_dtd["Blocks"]

    if "NextToken" in get_dtd:
        kwargs["NextToken"] = get_dtd["NextToken"]
    else:
        has_token = False
Enter fullscreen mode Exit fullscreen mode

And once we've got all the blocks, we can print the text in them similarly as we did with the sync call

print(">> For lines:")
for block in blocks:
    if block["BlockType"] == "LINE":
        print(block["Text"])

print("\n>> For words:")
for block in blocks:
    if block["BlockType"] == "WORD":
        print(block["Text"])
Enter fullscreen mode Exit fullscreen mode

Last Considerations

There are a lot of different use cases where Textract can fit in your architecture. Here there are an example of good practices in a large system https://github.com/aws-samples/amazon-textract-serverless-large-scale-document-processing

If you are thinking about using it in a big scale, you might want to look at the default limits, which are quite low when you start taking it seriously. If you're thinking about increasing the limits, I'd recommend you to have patience and start the process asap. It took to us more than 1 month and A LOT of support cases and emails (and time and effort) to get acceptable limits in our accounts.

There are many many more use cases than the one shown in this post. With the right libraries you can for example crop and store images and PDFs according to the text inside. Take a look at the official documentation

Discussion (0)

pic
Editor guide