LlamaParse Integration

LlamaParse is an advanced document parsing service that intelligently extracts text and structure from PDF lab reports, preserving formatting and medical data.

Overview

MedMitra uses LlamaParse for:

Lab Report Parsing: Extract text from PDF medical reports
Structure Preservation: Maintain tables, headers, and formatting
Multi-page Processing: Handle complex medical documents
Markdown Output: Clean, structured text output

Why LlamaParse?

Medical Document Optimized: Better than generic PDF parsers
Table Extraction: Preserves lab values in table format
Fast Processing: Async processing for multiple pages
Reliable: Handles various PDF formats and layouts

Prerequisites

A LlamaCloud account (sign up at cloud.llamaindex.ai)
Python 3.9+ for backend integration

Setup Instructions

1. Get a LlamaParse API Key

Visit cloud.llamaindex.ai
Sign up or log in to your account
Navigate to API Keys section
Click Generate API Key
Copy your API key

LlamaParse offers a free tier with limited credits. Monitor your usage in the dashboard.

2. Configure Environment Variables

Add to backend/.env:

LLAMAPARSE_API_KEY="llx_your_llamaparse_api_key_here"

3. Install LlamaParse SDK

The LlamaParse SDK is included in project dependencies:

pip install llama-cloud-services

Implementation

Parser Configuration

Location: backend/parsers/parse.py

from llama_cloud_services import LlamaParse
from config import LLAMAPARSE_API_KEY

# Initialize parser with configuration
parser = LlamaParse(
    api_key=LLAMAPARSE_API_KEY,
    num_workers=4,          # Parallel processing workers
    verbose=False,          # Disable verbose logging
    language="en",          # English language
    result_type="markdown"  # Output format
)

Configuration Options

api_key

string

required

Your LlamaParse API key from LlamaCloud

num_workers

integer

default:"4"

Number of parallel workers for processing pages

verbose

boolean

default:"false"

Enable detailed logging for debugging

language

string

default:"en"

Language of the documents (ISO 639-1 code)

result_type

string

default:"markdown"

Output format: markdown, text, or json

Async PDF Processing

async def process_pdf_async(file_path: str) -> dict:
    """
    Async function to process a single PDF file and extract page-wise content.

    Args:
        file_path: Path to the PDF file (local or URL)

    Returns:
        Dictionary containing extracted text and status
    """
    try:
        # Parse the PDF asynchronously
        results = await parser.aparse(file_path)
        
        # Extract text from all pages
        text = ""
        for page in results.pages:
            text += page.md + "\n"  # Markdown content
            text += "=" * 80 + "\n"  # Page separator
        
        return {
            "text": text,
            "status": "success"
        }

    except Exception as e:
        return {
            "text": "",
            "status": "error",
            "error": str(e)
        }

Usage in MedMitra

Document Upload Flow

User uploads PDF → Saved to Supabase Storage
Get file URL → Public URL from Supabase
Parse PDF → LlamaParse extracts text
Store results → Text saved to database
AI Analysis → Groq processes extracted text

Example: Processing Lab Report

from parsers.parse import process_pdf_async
from supabase_client import SupabaseCaseClient

supabase = SupabaseCaseClient()

async def process_lab_report(case_id: str, file_id: str, file_url: str):
    """
    Process a lab report PDF and store extracted text.
    """
    # Parse the PDF
    result = await process_pdf_async(file_url)
    
    if result["status"] == "success":
        # Store extracted text in file metadata
        await supabase.update_case_file_metadata(
            file_id=file_id,
            metadata={
                "extracted_text": result["text"],
                "parse_status": "completed"
            }
        )
        
        print(f"Successfully parsed lab report for case {case_id}")
        return result["text"]
    else:
        print(f"Error parsing PDF: {result['error']}")
        raise Exception(result["error"])

Output Format

Markdown Output

LlamaParse returns structured markdown that preserves:

# Lab Report - Complete Blood Count (CBC)

**Patient**: John Doe
**Date**: 2024-03-04
**Lab**: Medical Center Laboratory

## Test Results

| Test | Result | Reference Range | Unit | Flag |
|------|--------|-----------------|------|------|
| WBC | 7.5 | 4.0-11.0 | K/uL | Normal |
| RBC | 4.8 | 4.5-5.5 | M/uL | Normal |
| Hemoglobin | 14.2 | 13.5-17.5 | g/dL | Normal |
| Hematocrit | 42 | 38-50 | % | Normal |
| Platelets | 250 | 150-400 | K/uL | Normal |

## Interpretation

All values within normal reference ranges.

================================================================================

Advantages of Markdown Format

Structured: Headers, tables, and lists preserved
Readable: Easy to display in UI or process with AI
Parseable: Can be further processed for specific data
Convertible: Easy to convert to HTML or other formats

Integration with AI Workflow

Extracted text is used in the Medical Insights Agent:

# 1. Parse lab report
lab_text = await process_pdf_async(lab_report_url)

# 2. Combine with other data
context = {
    "doctor_notes": case.doctor_notes,
    "lab_data": lab_text,
    "radiology_data": radiology_summaries
}

# 3. Generate AI insights using Groq
insights = await medical_agent.generate_insights(context)

Best Practices

File Handling

Upload PDFs to Supabase Storage first
Use public URLs for LlamaParse access
Store extracted text in database for caching
Keep original PDFs for reference

Error Handling

Always check parse status before using results
Implement retry logic for failed parses
Log parsing errors for debugging
Have fallback for unparseable documents

Performance

Use async processing for multiple files
Set appropriate num_workers for your use case
Cache extracted text to avoid re-parsing
Monitor LlamaParse usage and credits

Data Quality

Validate extracted text structure
Check for missing tables or sections
Handle multi-page reports properly
Preserve page boundaries for context

Supported PDF Types

LlamaParse works well with:

Lab Reports: CBC, Chemistry panels, Lipid panels
Pathology Reports: Biopsy results, Cytology
Radiology Reports: Text-based findings and impressions
Medical Records: Patient history, Discharge summaries
Test Results: Any structured medical test data

LlamaParse is optimized for text-based PDFs. Scanned images may require OCR preprocessing.

Cost & Usage Limits

Free Tier

1,000 pages/month free
No credit card required
Basic support

Paid Tiers

Pay-as-you-go: $0.003 per page
Volume discounts available
Priority support

Monitor your usage in the LlamaCloud dashboard. Set up alerts to avoid unexpected costs.

Troubleshooting

API Key Issues

Error: Invalid API key

Verify LLAMAPARSE_API_KEY in .env file
Check key hasn’t expired
Ensure no extra spaces in key
Regenerate key from dashboard

Parsing Failures

Error: Failed to parse PDF

Check PDF is not corrupted
Verify file URL is accessible
Ensure PDF is not password-protected
Check file size is within limits
Try with verbose=True for debugging

Empty Results

Error: Extracted text is empty

PDF may be image-based (needs OCR)
Check if PDF has extractable text
Verify PDF is not encrypted
Try different result_type

Rate Limiting

Error: Rate limit exceeded

Reduce num_workers
Add delays between requests
Upgrade to paid tier
Implement request queuing

Advanced Configuration

Custom Parser Settings

parser = LlamaParse(
    api_key=LLAMAPARSE_API_KEY,
    num_workers=8,                    # More workers for faster processing
    verbose=True,                     # Enable debugging
    language="en",
    result_type="markdown",
    
    # Advanced options
    parsing_instruction="Focus on extracting lab values and reference ranges",
    skip_diagonal_text=True,          # Skip watermarks
    invalidate_cache=False,           # Use cached results if available
    do_not_cache=False,               # Cache results for future use
    continuous_mode=True,             # Better for multi-page documents
)

Batch Processing

async def process_multiple_pdfs(file_urls: list) -> list:
    """
    Process multiple PDFs in parallel.
    """
    tasks = [process_pdf_async(url) for url in file_urls]
    results = await asyncio.gather(*tasks)
    return results

Alternative Parsing Options

If LlamaParse doesn’t meet your needs:

PyMuPDF: Free, but less intelligent structure extraction
PyPDF2: Simple, but limited table support
Tabula: Good for tables, but requires Java
AWS Textract: Enterprise option with OCR

Next Steps

Gladia Integration

Set up speech-to-text for medical dictation

Medical AI Agents

Learn how parsed data is used by AI agents

External Services

LlamaParse Integration

Overview

Why LlamaParse?

Prerequisites

Setup Instructions

1. Get a LlamaParse API Key

2. Configure Environment Variables

3. Install LlamaParse SDK

Implementation

Parser Configuration

Configuration Options

Async PDF Processing

Usage in MedMitra

Document Upload Flow

Example: Processing Lab Report

Output Format

Markdown Output

Advantages of Markdown Format

Integration with AI Workflow

Best Practices

Supported PDF Types

Cost & Usage Limits

Free Tier

Paid Tiers

Troubleshooting

Advanced Configuration

Custom Parser Settings

Batch Processing

Alternative Parsing Options

Next Steps

Gladia Integration

Medical AI Agents

Build docs developers (and LLMs) love

External Services

​Overview

​Why LlamaParse?

​Prerequisites

​Setup Instructions

​1. Get a LlamaParse API Key

​2. Configure Environment Variables

​3. Install LlamaParse SDK

​Implementation

​Parser Configuration

​Configuration Options

​Async PDF Processing

​Usage in MedMitra

​Document Upload Flow

​Example: Processing Lab Report

​Output Format

​Markdown Output

​Advantages of Markdown Format

​Integration with AI Workflow

​Best Practices

​Supported PDF Types

​Cost & Usage Limits

​Free Tier

​Paid Tiers

​Troubleshooting

​Advanced Configuration

​Custom Parser Settings

​Batch Processing

​Alternative Parsing Options

​Next Steps

Gladia Integration

Medical AI Agents

Build docs developers (and LLMs) love

Overview

Why LlamaParse?

Prerequisites

Setup Instructions

1. Get a LlamaParse API Key

2. Configure Environment Variables

3. Install LlamaParse SDK

Implementation

Parser Configuration

Configuration Options

Async PDF Processing

Usage in MedMitra

Document Upload Flow

Example: Processing Lab Report

Output Format

Markdown Output

Advantages of Markdown Format

Integration with AI Workflow

Best Practices

Supported PDF Types

Cost & Usage Limits

Free Tier

Paid Tiers

Troubleshooting

Advanced Configuration

Custom Parser Settings

Batch Processing

Alternative Parsing Options

Next Steps