DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Mixpeek
Mixpeek

Posted on

Search text from PDF files stored in an S3 bucket

Does your application allow users to upload PDFs? Maybe they upload resumes, waivers, agreements or signed documents. What if they need to search the contents of these PDFs?

As a developer, you have 3 options:

  1. Search by Filename: Lookup by key/value like filename [Native]
  2. Search by Metadata: Store the metadata in a separate database to perform queries [Database add-on]
  3. Full-Text-Search: Extract the contents into a search engine [OCR_, Database, Search add-on]_

Full Text Search provides the most intuitive user experience, but it’s also the most challenging to build, maintain, and enhance.

data diagram

In this tutorial, we’ll walk you through best practices for PDF file upload, content extraction via OCR (Optical Character Recognition), and searching so you can add full-text PDF search into your application, with ease.

Bonus: At the end will be a Github repository so you can import the code directly into your application.

Store the file

First we need a function to download the file locally in order to run our OCR extraction logic:

import boto3s3_client = boto3.client(  
    's3',  
    aws_access_key_id='aws_access_key_id',  
    aws_secret_access_key='aws_secret_access_key',  
    region_name='region_name'  
)

with open(s3_file_name, 'wb') as file:  
        s3_client.download_fileobj(  
            bucket_name,  
            s3_file_name,  
            file  
        )
Enter fullscreen mode Exit fullscreen mode

Extract the contents

We’ll use the open source, Apache Tika library, which contains a class: AutoDetectParser that does OCR (optical character recognition):

from tika import parser
parsed_pdf_content = parser.from_file(s3_file_name)['content']
Enter fullscreen mode Exit fullscreen mode

Insert contents into a search engine

We’re using a self-managed OpenSearch node here, but you can use Lucene, SOLR, ElasticSearch or Atlas Search.

Note: if you don’t have OpenSearch locally you must install it first, then run it:

brew update  
brew install opensearch  
opensearch
Enter fullscreen mode Exit fullscreen mode

OpenSearch will now be accessible here: http://localhost:9200. Let’s build the index and insert the file contents:

from opensearchpy import OpenSearch
os = OpenSearch("http://localhost:9200/")  

index_name="pdf-search"doc = {  
    "filename": s3_file_name,  
    "parsed_pdf_content": parsed_pdf_content  
}

response = os.index(  
    index=index_name,  
    body=doc,  
    id=1,  
    refresh=True  
)
Enter fullscreen mode Exit fullscreen mode

Creating a PDF search API

We’ll use Flask to create a microservice that searches terms:

from flask import Flask, jsonify, request  
from opensearchpy import OpenSearch  
from config import *

app = Flask(__name__)  
    os = OpenSearch("http://localhost:9200/")
    @app.route('/search', methods=['GET'])  
    def search_file():  
        query = request.args.get('q', default = None, type = str)# query payload to ES  
        payload = {  
            'query': {  
                'match': {  
                    'parsed_pdf_content': query  
                }  
            }  
        }  

    response = os.search(  
        body=payload,  
        index=index_name  
    )

return jsonify(response)if __name__ == '__main__':  
    app.run(host="localhost", port=5011, debug=True)
Enter fullscreen mode Exit fullscreen mode

Now we can call the API via:

GET: http://localhost:5011/search?q=SEARCH_TERM
{  
      "_shards": {  
        "failed": 0,   
        "skipped": 0,   
        "successful": 1,   
        "total": 1  
      },   
      "hits": {  
        "hits": [  
          {  
            "_id": "1",   
            "_index": "pdf-search",   
            "_score": 0.29289162,   
            "_source": {  
              "filename": "prescription.pdf",   
              "parsed_pdf_content": "http://localhost:5011/search?q=SEARCH_TERM"  
            }  
          }  
        ],   
        "max_score": 0.29289162,   
        "total": {  
          "relation": "eq",   
          "value": 1  
        }  
      },   
      "timed_out": false,   
      "took": 40  
    }
Enter fullscreen mode Exit fullscreen mode

Whoo we did it! We’ve successfully created an API that offers full text PDF search.

congrats

You can download the repo here: https://github.com/mixpeek/pdf-search-s3

So what’s next?

  • Queuing: Ensuring concurrent file uploads are not dropped
  • Security: Adding end to end encryption to the data pipeline
  • Enhancements: Including more features like fuzzy, highlighting and autocomplete
  • Rate Limiting: Building thresholds so users don’t abuse the system

Everything collapsed into just 2 API calls

If this feels like too much for you to build, maintain, and enhance, Mixpeek has you covered.

Upload

import requests

url = "https://api.mixpeek.com/upload"  
    files=[  
      ('file',('FILE_NAME.pdf',open('FILE_NAME.pdf','rb'),'pdf'))  
    ]  
    response = requests.request("POST", url, files=files)
Enter fullscreen mode Exit fullscreen mode

Search

import requests
url = "https://api.mixpeek.com/search?q=SEARCH_QUERY"
response = requests.request("GET", url)
print(response.text)
Enter fullscreen mode Exit fullscreen mode

Corresponding Postman Collection for your convenience.

Request an API key for free, and review the docs to get started.

Top comments (1)

Collapse
 
delphaoakes profile image
DelphaOakes

How do I extract files from AWS S3? islamic dua to forget someone you love

We are hiring! Do you want to be our Senior Platform Engineer? Forem is hiring a Senior Platform Engineer

If you're interested in ops and site reliability and capable of dipping in to our Linux stack, we'd love your help shoring up our systems!