Rahul Jain

Posted on Jul 17

Applying LLM to build Digital Medical Record System: From Paper to Structured Data

#ai #genai #opensource #machinelearning

Introduction

In today's digital age, the healthcare industry is still grappling with the challenge of converting paper records into structured, easily accessible digital data. This article will guide you through building a comprehensive digital medical record system that scans documents, extracts relevant information, and stores it in a structured format. We'll cover the entire process, from backend development to frontend design, and discuss future improvements.

Base problem: How to scan a medical report in PDF format to structured format

The solution has two major aspects. First is converting PDF/images to text content, which is pretty much solved using OCR or parser libraries like langchain.document_loaders.parsers or unstructured. These tools are highly effective at extracting text from a variety of document formats, ensuring that the content is accurately captured from scanned images or PDF files. By utilizing these libraries, we can handle a wide range of document types, from medical reports to handwritten notes, and convert them into machine-readable text. The accuracy of these tools means that minimal post-processing is required, allowing us to focus on the next critical step.

The second aspect is converting the unstructured text into structured data, which is a more complex challenge. For this, we'll leverage the power of Large Language Models (LLMs). These models can understand and process natural language, enabling us to extract relevant information and organize it into a structured format. LLMs are particularly adept at identifying key entities, relationships, and data points within the text, such as patient names, dates, medical terms, and diagnostic information. By using LLMs, we can automate the process of data structuring, making it faster and more accurate than manual methods. This automation not only reduces the workload on healthcare professionals but also minimizes the risk of human error, ensuring that the structured data is reliable and consistent.

This two-pronged approach addresses both the technical and practical challenges of digitizing medical records, paving the way for improved data management and better healthcare outcomes.

Step 1 : Scanning the document and extract all text data

We'll leverage Langchain parsers for text extraction from scanned documents. Langchain offers a variety of parsers that can handle different document formats, ensuring accurate text extraction. This functionality is crucial for converting scanned medical reports into machine-readable text, making the subsequent text processing steps more efficient and reliable.

from langchain.document_loaders.parsers import BS4HTMLParser, PDFMinerParser
from langchain.document_loaders.parsers.generic import MimeTypeBasedParser
from langchain.document_loaders.parsers.txt import TextParser
from langchain_community.document_loaders import Blob
from langchain_core.documents import Document

HANDLERS = {
    "application/pdf": PDFMinerParser(),
    "text/plain": TextParser(),
    "text/html": BS4HTMLParser(),
}

SUPPORTED_MIMETYPES = sorted(HANDLERS.keys())

def convert_binary_input_to_blob(data: BinaryIO) -> Blob:
    file_data = data.read()
    mimetype = _guess_mimetype(file_data)
    file_name = data.name

    return Blob.from_data(
        data=file_data,
        path=file_name,
        mime_type=mimetype,
    )

with open(file_name, "rb") as f:
    blob = convert_binary_input_to_blob(f)
    parsed_doc = MIMETYPE_BASED_PARSER.parse(blob)

Step 2 : Text Processing with LLMs

We'll first create a flexible system that allows users to choose between different LLMs based on their API keys.

def get_supported_models():
    """Get models according to environment secrets."""
    models = {}
    if "OPENAI_API_KEY" in os.environ:
        models["gpt-3.5-turbo"] = {
            "chat_model": ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
            "description": "GPT-3.5 Turbo",
        }
        models["gpt-4o"] = {
            "chat_model": ChatOpenAI(model="gpt-4", temperature=0),
            "description": "GPT-4-O",
        }
    if "FIREWORKS_API_KEY" in os.environ:
        models["fireworks"] = {
            "chat_model": ChatFireworks(
                model="accounts/fireworks/models/firefunction-v1",
                temperature=0,
            ),
            "description": "Fireworks Firefunction-v1",
        }
    if "TOGETHER_API_KEY" in os.environ:
        models["together-ai-mistral-8x7b-instruct-v0.1"] = {
            "chat_model": ChatOpenAI(
                base_url="https://api.together.xyz/v1",
                api_key=os.environ["TOGETHER_API_KEY"],
                model="mistralai/Mixtral-8x7B-Instruct-v0.1",
                temperature=0,
            ),
            "description": "Mixtral 8x7B Instruct v0.1 (Together AI)",
        }
    if "ANTHROPIC_API_KEY" in os.environ:
        models["claude-3-sonnet-20240229"] = {
            "chat_model": ChatAnthropic(
                model="claude-3-sonnet-20240229", temperature=0
            ),
            "description": "Claude 3 Sonnet",
        }
    if "GROQ_API_KEY" in os.environ:
        models["groq-llama3-8b-8192"] = {
            "chat_model": ChatGroq(
                model="llama3-8b-8192",
                temperature=0,
            ),
            "description": "GROQ Llama 3 8B",
        }
    return models

Create schema in which the information should be structured. Let’s use JSON schema, as we can provide much detail information about each field

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "Medical Information Extractor",
    "description": "Schema for extracting patient and test information from text.",
    "type": "object",
    "properties": {
        "patient_name": {
            "type": "string",
            "title": "Patient Name",
            "description": "The name of the patient.",
        },
        "age": {
            "type": "integer",
            "title": "Age",
            "description": "The age of the patient.",
        },
        "date_of_birth": {
            "type": "string",
            "title": "Date of Birth",
            "description": "The date of birth of the patient.",
        },
        "doctor_name": {
            "type": "string",
            "title": "Doctor Name",
            "description": "The name of the doctor treating the patient.",
        },
        "date": {
            "type": "string",
            "title": "Date",
            "description": "The date of the medical record.",
        },
        "tests": {
            "type": "array",
            "title": "List of Tests",
            "description": "List of tests conducted for the patient.",
            "items": {
                "type": "object",
                "properties": {
                    "test_name": {
                        "type": "string",
                        "title": "Test Name",
                        "description": "The name of the test conducted.",
                    },
                    "markers": {
                        "type": "array",
                        "title": "List of markers",
                        "description": "List of markers calculated for the test.",
                        "items": {
                            "type": "object",
                            "properties": {
                                "marker_name": {
                                    "type": "string",
                                    "title": "Marker Name",
                                    "description": "The name of the marker measured.",
                                },
                                "normal_range": {
                                    "type": "object",
                                    "properties": {
                                        "min": {
                                            "type": "number",
                                            "title": "Minimum Value of normal range",
                                        },
                                        "max": {
                                            "type": "number",
                                            "title": "Maximum Value of normal range",
                                        },
                                    },
                                    "description": "The normal range of the parameter.",
                                },
                                "current_value": {
                                    "type": "number",
                                    "title": "Current Value",
                                    "description": "The current value of the parameter.",
                                },
                            },
                            "required": ["maker_name", "current_value"],
                        },
                    },
                },
                "required": ["test_name", "parameters"],
            },
        },
    },
    "required": [
        "patient_name",
        "age",
        "date_of_birth",
        "doctor_name",
        "date",
        "tests",
    ],
}

Prompt Generation

Create a detailed prompt for the model to extract specific information from the text. To enhance the model’s performance and accuracy, include clear and precise instructions within the prompt. Additionally, it is beneficial to provide some illustrative examples that demonstrate the desired outcome. These examples will serve as a guide for the model, helping it to understand exactly what information to look for and how to present it. By combining detailed instructions with relevant examples, you can significantly improve the efficiency and effectiveness of the model’s information extraction capabilities.

Few-Shot Learning Explanation

Few-shot learning is a technique used in machine learning where the model is trained to perform a task by being given only a few examples. This is in contrast to traditional machine learning methods that require large amounts of data to achieve high performance. In the context of prompt creation for information extraction, few-shot learning involves providing the model with a handful of examples of the task at hand.

Here’s how few-shot learning works in this scenario:

Instructions: Begin with a set of clear and concise instructions that guide the model on what to extract. These instructions should be specific to the type of information you need from the text.
Examples: Provide a few examples that illustrate the type of text the model will process and the expected output. These examples help the model understand the structure and format of the information it needs to extract.
Pattern Recognition: The model uses these instructions and examples to recognize patterns in the text. By learning from the few provided examples, it can generalize this knowledge to new, unseen text.

Example of Few-Shot Learning in a Prompt

def create_extraction_prompt(instructions: str, examples: list, content: str) -> ChatPromptTemplate:
    prefix = f"You are a top-tier algorithm for extracting information from medical text. {instructions}\\n\\n"
    prompt_components = [("system", prefix)]

    if examples is not None:
        few_shot_prompt = []
        for example in examples:
            _id = uuid.uuid4().hex[:]
            tool_call = {
                "args": {"data": example["output"]},
                "name": function_name,
                "id": _id,
            }
            few_shot_prompt.extend(
                [
                    HumanMessage(content=example["input"]),
                    AIMessage(content="", tool_calls=[tool_call]),
                    ToolMessage(
                        content="You have correctly called this tool.", tool_call_id=_id
                    ),
                ]
            )
        prompt_components.extend(few_shot_prompt)

    prompt_components.append(
        (
            "human",
            "I need to extract information from "
            "the following text: ```

\n{text}\n

```\n",
        ),
    )
    return ChatPromptTemplate.from_messages(prompt_components)

# Instructions for the model
instructions = (
    "The documents will be lab test reports."
    "The document might have header and footer repeated multiple times, "
    "ignore these repetitions."
    "The table's header will be repeated multiple time, ignore that as well."
    "While ignoring table header, put the parameter in the previous test"
    "Only extract information that is relevant to the provided text. "
    "If no information is relevant, use the schema and output "
    "an empty list where appropriate."
)

# Examples to guide the model
examples = [
    {
        "input": "Patient: John Doe\\nAge: 45\\nTest: Blood Test\\nMarker: Hemoglobin\\nValue: 13.5 g/dL\\n",
        "output": {
            "patient_name": "John Doe",
            "age": 45,
            "tests": [
                {
                    "test_name": "Blood Test",
                    "markers": [
                        {
                            "marker_name": "Hemoglobin",
                            "current_value": 13.5,
                            "unit": "g/dL"
                        }
                    ]
                }
            ]
        }
    },
    {
        "input": "Patient: Jane Smith\\nDOB: 1980-05-12\\nTest: Cholesterol\\nMarker: LDL\\nValue: 120 mg/dL\\n",
        "output": {
            "patient_name": "Jane Smith",
            "date_of_birth": "1980-05-12",
            "tests": [
                {
                    "test_name": "Cholesterol",
                    "markers": [
                        {
                            "marker_name": "LDL",
                            "current_value": 120,
                            "unit": "mg/dL"
                        }
                    ]
                }
            ]
        }
    }
]

# Content for the model to process
content = "Patient: Alice Brown\\nAge: 62\\nTest: Glucose\\nMarker: Fasting Blood Sugar\\nValue: 95 mg/dL\\n"

# Create the prompt
prompt = create_extraction_prompt(instructions, examples, content)
print(prompt)

Now, we need to create a model chain

Introducing two new concepts here: the first involves converting your custom logic into a runnable using the @chain decorator provided by Langchain. This decorator allows you to seamlessly integrate your custom code into a reusable and executable format. The second concept is Langchain's sophisticated mechanism of chaining, which utilizes LCEL (Langchain Execution Language) constructs. These constructs include elements such as prompt | preprocessing | model | postprocessor, enabling a streamlined flow where the initial prompt is processed, run through a model, and then post-processed. This chaining mechanism ensures that each step is modular and can be easily managed or modified, providing flexibility and efficiency in executing complex logic.

@chain
async def extraction_runnable(extraction_request: ExtractRequest) -> ExtractResponse:
    """An end point to extract content from a given text object."""
    schema = get_schema()
    try:
        Draft202012Validator.check_schema(schema)
    except exceptions.ValidationError as e:
        raise HTTPException(status_code=422, detail=f"Invalid schema: {e.message}")

    prompt = ... # Defined in previous step
    model = get_model(extraction_request.model_name)
    runnable = (prompt | model.with_structured_output(schema=schema)).with_config(
        {"run_name": "extraction"}
    )

    return await runnable.ainvoke({"text": extraction_request.text})

Additionally, to address the small context size compared to the large document, the code below will help with chunked processing.

async def extract_entire_document(
    content: str,
    document_type: str,
    model_name: str,
) -> ExtractResponse:
    """Extract from entire document."""

    json_schema = ... # Generate schema of extracted data
    text_splitter = TokenTextSplitter(
        chunk_size=get_chunk_size(model_name),
        chunk_overlap=20,
        model_name=model_name,
    )
    texts = text_splitter.split_text(content)
    extraction_requests = [
        ExtractRequest(
            text=text,
            schema=json_schema,
            model_name=model_name,
            document_type=document_type,
        )
        for text in texts
    ]

    # Limit the number of chunks to process
    if len(extraction_requests) > settings.MAX_CHUNKS and settings.MAX_CHUNKS > 0:
        content_too_long = True
        extraction_requests = extraction_requests[: settings.MAX_CHUNKS]
    else:
        content_too_long = False

    # Run extractions which may potentially yield duplicate results
    logger.info(f"Extrating document in {len(extraction_requests)} batches")
    extract_responses: List[ExtractResponse] = await extraction_runnable.abatch(
        extraction_requests, {"max_concurrency": settings.MAX_CONCURRENCY}
    )
    # Deduplicate the results
    return {
        "data": deduplicate(extract_responses)["data"],
        "content_too_long": content_too_long,
    }

Rest is standard engineering to store structured information in Database

The complete source code with Frontend and docker compose files is available on github: https://github.com/rahuljainz/medical-records-AI

Future Improvements:

Incorporate Open-Source Models: Integrate open-source LLMs like FLAN-T5 or BART to reduce dependency on commercial APIs.
Fine-tune NER Models: Develop and fine-tune Named Entity Recognition (NER) models specifically for medical terminology to improve data extraction accuracy.
Implement Privacy Measures: Enhance data security and privacy compliance with encryption and access controls.
Mobile Application: Develop a mobile app for on-the-go access to medical records.
AI-Powered Health Insights: Implement AI algorithms to provide personalized health insights based on biomarker trends.

Conclusion:

Building a digital medical record system is a complex but rewarding project. By following this guide, you can create a powerful tool that streamlines record-keeping and provides valuable health insights. As technology evolves, continual improvement and adaptation will ensure your system remains cutting-edge and beneficial to users.

Remember, when dealing with medical data, always prioritize privacy, security, and compliance with relevant healthcare regulations.

DEV Community

Applying LLM to build Digital Medical Record System: From Paper to Structured Data

Introduction

Base problem: How to scan a medical report in PDF format to structured format

Step 1 : Scanning the document and extract all text data

Step 2 : Text Processing with LLMs

Prompt Generation

Few-Shot Learning Explanation

Example of Few-Shot Learning in a Prompt

Now, we need to create a model chain

Additionally, to address the small context size compared to the large document, the code below will help with chunked processing.

Rest is standard engineering to store structured information in Database

Future Improvements:

Conclusion:

Top comments (0)

Read next

Congrats to the Nylas Challenge Winners!

First Contributions: learn how to contribute to open source projects

Building Scalable ML Models on AWS-SageMake

My Experience with SafeLine: A Powerful and Free Web Application Firewall