Skip to main content
Learn the complete workflow for processing PDF documents, from text extraction to AI tag generation, with support for OCR, multiple languages, and various file sources.

Overview

The Meta-Data Tag Generator provides two processing modes:

Single Document

Process individual PDFs via file upload or URL

Batch Processing

Process multiple documents using CSV with real-time progress tracking

Single Document Processing

Process individual PDF documents with AI-powered tag generation.

Processing Flow

1

Upload or Provide URL

Submit a PDF file or provide a publicly accessible URL to the document.
2

Text Extraction

The system extracts text using a 3-tier approach:
  • PyPDF2 for text-based PDFs (fastest)
  • Tesseract OCR for scanned documents in Hindi/English
  • EasyOCR for complex Indian scripts (automatic fallback)
3

AI Tag Generation

Extracted text is sent to OpenRouter AI to generate searchable metadata tags.
4

Results

Receive generated tags, extracted text preview, and processing metadata.

API Endpoint

POST /api/single/process

Authentication

All processing endpoints require authentication. Include your access token:
Authorization: Bearer <your-access-token>

Processing via File Upload

import requests
import json

url = "http://localhost:8000/api/single/process"

# Configuration for AI tagging
config = {
    "api_key": "your-openrouter-api-key",
    "model_name": "openai/gpt-4o-mini",
    "num_pages": 3,
    "num_tags": 8,
    "exclusion_words": ["government", "india"]  # Optional
}

# Prepare multipart form data
files = {
    "pdf_file": open("document.pdf", "rb")
}
data = {
    "config": json.dumps(config)
}
headers = {
    "Authorization": f"Bearer {access_token}"
}

response = requests.post(url, files=files, data=data, headers=headers)
result = response.json()

print(f"Document: {result['document_title']}")
print(f"Tags: {result['tags']}")
print(f"Extraction method: {result['extraction_method']}")
print(f"Processing time: {result['processing_time']}s")

Processing via URL

Process documents from public URLs without downloading them first:
URL Processing
import requests
import json

url = "http://localhost:8000/api/single/process"

config = {
    "api_key": "your-openrouter-api-key",
    "model_name": "openai/gpt-4o-mini",
    "num_pages": 3,
    "num_tags": 8
}

data = {
    "pdf_url": "https://example.com/document.pdf",
    "config": json.dumps(config)
}

headers = {"Authorization": f"Bearer {access_token}"}
response = requests.post(url, data=data, headers=headers)
result = response.json()
The system supports various URL types including CloudFront, S3 public URLs, and government portals. URLs must be publicly accessible (no authentication required).

Configuration Parameters

api_key
string
required
Your OpenRouter API key for AI tag generation
model_name
string
default:"openai/gpt-4o-mini"
AI model to use. Recommended options:
  • openai/gpt-4o-mini (fast, cost-effective)
  • google/gemini-flash-1.5 (fast, good for multilingual)
  • anthropic/claude-3-haiku (high quality)
num_pages
integer
default:3
Number of PDF pages to extract (1-10). More pages = better context but higher API costs.
num_tags
integer
default:8
Number of tags to generate (3-15). Tags are optimized for ElasticSearch.
exclusion_words
array
Words/phrases to exclude from generated tags. Useful for filtering common organizational terms.

Response Format

Response
{
  "success": true,
  "document_title": "Training Manual 2024",
  "tags": [
    "training-manual-2024",
    "employee-development",
    "organizational-policies",
    "standard-operating-procedures",
    "compliance-guidelines",
    "department-protocols",
    "workforce-training",
    "quality-assurance"
  ],
  "extracted_text_preview": "TRAINING MANUAL 2024\n\nTable of Contents\n1. Introduction to Employee Development...",
  "processing_time": 4.23,
  "is_scanned": false,
  "extraction_method": "pypdf2",
  "ocr_confidence": null,
  "raw_ai_response": "training-manual-2024, employee-development, ..."
}
success
boolean
Indicates if processing completed successfully
document_title
string
Extracted document title from PDF metadata or filename
tags
array
Generated metadata tags optimized for search
extracted_text_preview
string
First 500 characters of extracted text
processing_time
float
Total processing time in seconds
is_scanned
boolean
Whether the document required OCR processing
extraction_method
string
Method used: pypdf2, tesseract_ocr, or easyocr
ocr_confidence
float
OCR confidence score (0-100) for scanned documents

OCR and Multi-Language Support

The system automatically detects scanned documents and applies appropriate OCR:
When used: Documents with selectable textLanguages: All languages supported by PDF text layerSpeed: Fastest (< 1 second)Accuracy: 100% (reads embedded text directly)
{
  "extraction_method": "pypdf2",
  "is_scanned": false
}
When used: Scanned documents, automatic fallback from PyPDF2Languages: Hindi (hin), English (eng)Speed: Fast (3-5 seconds per page)Accuracy: Good for clear scans (70-95%)Automatic fallback: Switches to EasyOCR if confidence < 60%
{
  "extraction_method": "tesseract_ocr",
  "is_scanned": true,
  "ocr_confidence": 85.3
}
When used:
  • Tesseract confidence < 60%
  • Complex Indian language scripts detected
  • Low-quality scans
Languages: 80+ languages including Hindi, Tamil, Telugu, Bengali, Kannada, Malayalam, Marathi, Gujarati, Punjabi, and moreSpeed: Slower (10-30 seconds per page, GPU-accelerated)Accuracy: Excellent for complex scripts and poor quality scans (85-98%)
{
  "extraction_method": "easyocr",
  "is_scanned": true,
  "ocr_confidence": 92.7
}

Exclusion List Feature

Filter out common organizational terms that don’t add search value:

Via exclusion_words Parameter

Inline Exclusion
config = {
    "api_key": "sk-or-v1-...",
    "model_name": "openai/gpt-4o-mini",
    "num_pages": 3,
    "num_tags": 8,
    "exclusion_words": [
        "government-india",
        "ministry-of-social-justice",
        "annual-report",
        "newsletter"
    ]
}

Via Exclusion File Upload

Upload a .txt or .pdf file containing exclusion terms:
File-based Exclusion
files = {
    "pdf_file": open("document.pdf", "rb"),
    "exclusion_file": open("exclusion-list.txt", "rb")
}
data = {
    "config": json.dumps(config)
}

response = requests.post(url, files=files, data=data, headers=headers)
Exclusion file format (exclusion-list.txt):
# Common government organizations
government-india
ministry-of-social-justice
social-justice

# Generic document types
annual-report
newsletter
policy-document

# Overly generic terms
empowerment
constitutional-provisions
The system uses a two-layer approach: AI is instructed to avoid excluded terms, and any that slip through are filtered in post-processing. The system ensures you always get the requested number of tags.

Batch Processing

Process multiple documents efficiently with real-time progress tracking.

Batch Processing Flow

1

Prepare CSV File

Create a CSV with document metadata and file paths (URLs, S3 paths, or local files).
2

Start Batch Job

Submit CSV and configuration to start background processing.
3

Monitor Progress

Connect via WebSocket to receive real-time updates for each document.
4

Retrieve Results

Access processed results with tags for all documents.

Start Batch Job

POST /api/batch/start
import requests
import json

url = "http://localhost:8000/api/batch/start"

documents = [
    {
        "title": "Training Manual",
        "description": "Employee training document",
        "file_source_type": "url",
        "file_path": "https://example.com/doc1.pdf",
        "publishing_date": "2025-01-15",
        "file_size": "1.2MB"
    },
    {
        "title": "Annual Report 2024",
        "file_source_type": "url",
        "file_path": "https://example.com/doc2.pdf"
    }
]

config = {
    "api_key": "your-openrouter-api-key",
    "model_name": "openai/gpt-4o-mini",
    "num_pages": 3,
    "num_tags": 8
}

payload = {
    "documents": documents,
    "config": config
}

headers = {"Authorization": f"Bearer {access_token}"}
response = requests.post(url, json=payload, headers=headers)
job = response.json()

print(f"Job started: {job['job_id']}")
print(f"Total documents: {job['total_documents']}")

Real-time Progress via WebSocket

Connect to WebSocket endpoint to receive live updates:
WebSocket Progress Monitoring
import asyncio
import websockets
import json

async def monitor_batch_progress(job_id, access_token):
    uri = f"ws://localhost:8000/api/batch/ws/{job_id}?token={access_token}"
    
    async with websockets.connect(uri) as websocket:
        while True:
            message = await websocket.recv()
            data = json.loads(message)
            
            if data['type'] == 'catchup':
                print(f"Catching up: {len(data['results'])} results so far")
            
            elif data['type'] == 'progress':
                print(f"[{data['row_number']}/{data.get('total', '?')}] {data['title']}: {data['status']}")
                if data['status'] == 'success':
                    print(f"  Tags: {data['tags']}")
                elif data['status'] == 'failed':
                    print(f"  Error: {data['error']}")
            
            elif data['type'] == 'completed':
                print(f"\nBatch completed!")
                print(f"Processed: {data['processed_count']}")
                print(f"Failed: {data['failed_count']}")
                print(f"Time: {data['processing_time']}s")
                break
            
            elif data['type'] == 'error':
                print(f"Error: {data.get('message')}")
                break

# Run monitoring
asyncio.run(monitor_batch_progress(job_id, access_token))

WebSocket Message Types

Sent when you first connect, contains all results processed so far:
{
  "type": "catchup",
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "state": {
    "status": "processing",
    "progress": 0.3,
    "processed_count": 3,
    "total": 10
  },
  "results": [
    {"row_id": 0, "status": "success", "tags": [...]},
    {"row_id": 1, "status": "success", "tags": [...]},
    {"row_id": 2, "status": "failed", "error": "..."}
  ]
}
Sent for each document as it’s processed:
{
  "type": "progress",
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "row_id": 3,
  "row_number": 4,
  "title": "Annual Report 2024",
  "status": "success",
  "progress": 0.4,
  "tags": ["annual-report-2024", "financial-summary", ...],
  "metadata": {
    "extraction_method": "pypdf2",
    "is_scanned": false,
    "processing_time": 3.2
  }
}
Sent when all documents are processed:
{
  "type": "completed",
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "total_documents": 10,
  "processed_count": 9,
  "failed_count": 1,
  "processing_time": 45.2,
  "message": "Completed: 9 succeeded, 1 failed"
}
Sent if job encounters fatal error:
{
  "type": "error",
  "job_id": "550e8400-e29b-41d4-a716-446655440000",
  "message": "Job cancelled by user"
}

Job Control

Manage running batch jobs:
url = f"http://localhost:8000/api/batch/jobs/{job_id}/status"
headers = {"Authorization": f"Bearer {access_token}"}

response = requests.get(url, headers=headers)
status = response.json()

print(f"Status: {status['status']}")
print(f"Progress: {status['progress'] * 100}%")
print(f"Processed: {status['processed_count']}/{status['total']}")

Path Validation

Validate file paths before processing to catch errors early:
Pre-flight Validation
url = "http://localhost:8000/api/batch/validate-paths"

payload = {
    "paths": [
        {"path": "https://example.com/doc1.pdf", "type": "url"},
        {"path": "s3://my-bucket/doc2.pdf", "type": "s3"},
        {"path": "/local/path/doc3.pdf", "type": "local"}
    ]
}

headers = {"Authorization": f"Bearer {access_token}"}
response = requests.post(url, json=payload, headers=headers)
results = response.json()

for result in results['results']:
    if result['valid']:
        print(f"✓ {result['path']} - {result['content_type']}")
    else:
        print(f"✗ {result['path']} - {result['error']}")

print(f"\nValid: {results['valid_count']}, Invalid: {results['invalid_count']}")

Best Practices

Optimize Page Count

Start with 3 pages for most documents. Increase for complex docs, decrease for simple ones to save API costs.

Use Exclusion Lists

Create organization-specific exclusion lists to filter common terms that don’t add search value.

Choose Right Model

  • gpt-4o-mini: Best balance of speed/cost
  • gemini-flash-1.5: Excellent for multilingual
  • claude-3-haiku: Highest quality

Handle OCR Confidence

For scanned docs, check ocr_confidence. If < 70%, consider manual review or higher-quality scan.

Build docs developers (and LLMs) love