Learn the complete workflow for processing PDF documents, from text extraction to AI tag generation, with support for OCR, multiple languages, and various file sources.
Overview
The Meta-Data Tag Generator provides two processing modes:
Single Document Process individual PDFs via file upload or URL
Batch Processing Process multiple documents using CSV with real-time progress tracking
Single Document Processing
Process individual PDF documents with AI-powered tag generation.
Processing Flow
Upload or Provide URL
Submit a PDF file or provide a publicly accessible URL to the document.
Text Extraction
The system extracts text using a 3-tier approach:
PyPDF2 for text-based PDFs (fastest)
Tesseract OCR for scanned documents in Hindi/English
EasyOCR for complex Indian scripts (automatic fallback)
AI Tag Generation
Extracted text is sent to OpenRouter AI to generate searchable metadata tags.
Results
Receive generated tags, extracted text preview, and processing metadata.
API Endpoint
Authentication
All processing endpoints require authentication. Include your access token:
Authorization: Bearer < your-access-toke n >
Processing via File Upload
File Upload
JavaScript
cURL
import requests
import json
url = "http://localhost:8000/api/single/process"
# Configuration for AI tagging
config = {
"api_key" : "your-openrouter-api-key" ,
"model_name" : "openai/gpt-4o-mini" ,
"num_pages" : 3 ,
"num_tags" : 8 ,
"exclusion_words" : [ "government" , "india" ] # Optional
}
# Prepare multipart form data
files = {
"pdf_file" : open ( "document.pdf" , "rb" )
}
data = {
"config" : json.dumps(config)
}
headers = {
"Authorization" : f "Bearer { access_token } "
}
response = requests.post(url, files = files, data = data, headers = headers)
result = response.json()
print ( f "Document: { result[ 'document_title' ] } " )
print ( f "Tags: { result[ 'tags' ] } " )
print ( f "Extraction method: { result[ 'extraction_method' ] } " )
print ( f "Processing time: { result[ 'processing_time' ] } s" )
Processing via URL
Process documents from public URLs without downloading them first:
import requests
import json
url = "http://localhost:8000/api/single/process"
config = {
"api_key" : "your-openrouter-api-key" ,
"model_name" : "openai/gpt-4o-mini" ,
"num_pages" : 3 ,
"num_tags" : 8
}
data = {
"pdf_url" : "https://example.com/document.pdf" ,
"config" : json.dumps(config)
}
headers = { "Authorization" : f "Bearer { access_token } " }
response = requests.post(url, data = data, headers = headers)
result = response.json()
The system supports various URL types including CloudFront, S3 public URLs, and government portals. URLs must be publicly accessible (no authentication required).
Configuration Parameters
Your OpenRouter API key for AI tag generation
model_name
string
default: "openai/gpt-4o-mini"
AI model to use. Recommended options:
openai/gpt-4o-mini (fast, cost-effective)
google/gemini-flash-1.5 (fast, good for multilingual)
anthropic/claude-3-haiku (high quality)
Number of PDF pages to extract (1-10). More pages = better context but higher API costs.
Number of tags to generate (3-15). Tags are optimized for ElasticSearch.
Words/phrases to exclude from generated tags. Useful for filtering common organizational terms.
{
"success" : true ,
"document_title" : "Training Manual 2024" ,
"tags" : [
"training-manual-2024" ,
"employee-development" ,
"organizational-policies" ,
"standard-operating-procedures" ,
"compliance-guidelines" ,
"department-protocols" ,
"workforce-training" ,
"quality-assurance"
],
"extracted_text_preview" : "TRAINING MANUAL 2024 \n\n Table of Contents \n 1. Introduction to Employee Development..." ,
"processing_time" : 4.23 ,
"is_scanned" : false ,
"extraction_method" : "pypdf2" ,
"ocr_confidence" : null ,
"raw_ai_response" : "training-manual-2024, employee-development, ..."
}
Indicates if processing completed successfully
Extracted document title from PDF metadata or filename
Generated metadata tags optimized for search
First 500 characters of extracted text
Total processing time in seconds
Whether the document required OCR processing
Method used: pypdf2, tesseract_ocr, or easyocr
OCR confidence score (0-100) for scanned documents
OCR and Multi-Language Support
The system automatically detects scanned documents and applies appropriate OCR:
When used : Documents with selectable textLanguages : All languages supported by PDF text layerSpeed : Fastest (< 1 second)Accuracy : 100% (reads embedded text directly){
"extraction_method" : "pypdf2" ,
"is_scanned" : false
}
Hindi/English Scans (Tesseract OCR)
When used : Scanned documents, automatic fallback from PyPDF2Languages : Hindi (hin), English (eng)Speed : Fast (3-5 seconds per page)Accuracy : Good for clear scans (70-95%)Automatic fallback : Switches to EasyOCR if confidence < 60%{
"extraction_method" : "tesseract_ocr" ,
"is_scanned" : true ,
"ocr_confidence" : 85.3
}
Complex Indian Scripts (EasyOCR)
When used :
Tesseract confidence < 60%
Complex Indian language scripts detected
Low-quality scans
Languages : 80+ languages including Hindi, Tamil, Telugu, Bengali, Kannada, Malayalam, Marathi, Gujarati, Punjabi, and moreSpeed : Slower (10-30 seconds per page, GPU-accelerated)Accuracy : Excellent for complex scripts and poor quality scans (85-98%){
"extraction_method" : "easyocr" ,
"is_scanned" : true ,
"ocr_confidence" : 92.7
}
Exclusion List Feature
Filter out common organizational terms that don’t add search value:
Via exclusion_words Parameter
config = {
"api_key" : "sk-or-v1-..." ,
"model_name" : "openai/gpt-4o-mini" ,
"num_pages" : 3 ,
"num_tags" : 8 ,
"exclusion_words" : [
"government-india" ,
"ministry-of-social-justice" ,
"annual-report" ,
"newsletter"
]
}
Via Exclusion File Upload
Upload a .txt or .pdf file containing exclusion terms:
files = {
"pdf_file" : open ( "document.pdf" , "rb" ),
"exclusion_file" : open ( "exclusion-list.txt" , "rb" )
}
data = {
"config" : json.dumps(config)
}
response = requests.post(url, files = files, data = data, headers = headers)
Exclusion file format (exclusion-list.txt):
# Common government organizations
government-india
ministry-of-social-justice
social-justice
# Generic document types
annual-report
newsletter
policy-document
# Overly generic terms
empowerment
constitutional-provisions
The system uses a two-layer approach: AI is instructed to avoid excluded terms, and any that slip through are filtered in post-processing. The system ensures you always get the requested number of tags.
Batch Processing
Process multiple documents efficiently with real-time progress tracking.
Batch Processing Flow
Prepare CSV File
Create a CSV with document metadata and file paths (URLs, S3 paths, or local files).
Start Batch Job
Submit CSV and configuration to start background processing.
Monitor Progress
Connect via WebSocket to receive real-time updates for each document.
Retrieve Results
Access processed results with tags for all documents.
Start Batch Job
Start Batch Job
JavaScript
import requests
import json
url = "http://localhost:8000/api/batch/start"
documents = [
{
"title" : "Training Manual" ,
"description" : "Employee training document" ,
"file_source_type" : "url" ,
"file_path" : "https://example.com/doc1.pdf" ,
"publishing_date" : "2025-01-15" ,
"file_size" : "1.2MB"
},
{
"title" : "Annual Report 2024" ,
"file_source_type" : "url" ,
"file_path" : "https://example.com/doc2.pdf"
}
]
config = {
"api_key" : "your-openrouter-api-key" ,
"model_name" : "openai/gpt-4o-mini" ,
"num_pages" : 3 ,
"num_tags" : 8
}
payload = {
"documents" : documents,
"config" : config
}
headers = { "Authorization" : f "Bearer { access_token } " }
response = requests.post(url, json = payload, headers = headers)
job = response.json()
print ( f "Job started: { job[ 'job_id' ] } " )
print ( f "Total documents: { job[ 'total_documents' ] } " )
Real-time Progress via WebSocket
Connect to WebSocket endpoint to receive live updates:
WebSocket Progress Monitoring
import asyncio
import websockets
import json
async def monitor_batch_progress ( job_id , access_token ):
uri = f "ws://localhost:8000/api/batch/ws/ { job_id } ?token= { access_token } "
async with websockets.connect(uri) as websocket:
while True :
message = await websocket.recv()
data = json.loads(message)
if data[ 'type' ] == 'catchup' :
print ( f "Catching up: { len (data[ 'results' ]) } results so far" )
elif data[ 'type' ] == 'progress' :
print ( f "[ { data[ 'row_number' ] } / { data.get( 'total' , '?' ) } ] { data[ 'title' ] } : { data[ 'status' ] } " )
if data[ 'status' ] == 'success' :
print ( f " Tags: { data[ 'tags' ] } " )
elif data[ 'status' ] == 'failed' :
print ( f " Error: { data[ 'error' ] } " )
elif data[ 'type' ] == 'completed' :
print ( f " \n Batch completed!" )
print ( f "Processed: { data[ 'processed_count' ] } " )
print ( f "Failed: { data[ 'failed_count' ] } " )
print ( f "Time: { data[ 'processing_time' ] } s" )
break
elif data[ 'type' ] == 'error' :
print ( f "Error: { data.get( 'message' ) } " )
break
# Run monitoring
asyncio.run(monitor_batch_progress(job_id, access_token))
WebSocket Message Types
Sent when you first connect, contains all results processed so far: {
"type" : "catchup" ,
"job_id" : "550e8400-e29b-41d4-a716-446655440000" ,
"state" : {
"status" : "processing" ,
"progress" : 0.3 ,
"processed_count" : 3 ,
"total" : 10
},
"results" : [
{ "row_id" : 0 , "status" : "success" , "tags" : [ ... ]},
{ "row_id" : 1 , "status" : "success" , "tags" : [ ... ]},
{ "row_id" : 2 , "status" : "failed" , "error" : "..." }
]
}
progress - Document Update
Sent for each document as it’s processed: {
"type" : "progress" ,
"job_id" : "550e8400-e29b-41d4-a716-446655440000" ,
"row_id" : 3 ,
"row_number" : 4 ,
"title" : "Annual Report 2024" ,
"status" : "success" ,
"progress" : 0.4 ,
"tags" : [ "annual-report-2024" , "financial-summary" , ... ],
"metadata" : {
"extraction_method" : "pypdf2" ,
"is_scanned" : false ,
"processing_time" : 3.2
}
}
Sent when all documents are processed: {
"type" : "completed" ,
"job_id" : "550e8400-e29b-41d4-a716-446655440000" ,
"total_documents" : 10 ,
"processed_count" : 9 ,
"failed_count" : 1 ,
"processing_time" : 45.2 ,
"message" : "Completed: 9 succeeded, 1 failed"
}
Sent if job encounters fatal error: {
"type" : "error" ,
"job_id" : "550e8400-e29b-41d4-a716-446655440000" ,
"message" : "Job cancelled by user"
}
Job Control
Manage running batch jobs:
Get Job Status
Cancel Job
Pause/Resume Job
url = f "http://localhost:8000/api/batch/jobs/ { job_id } /status"
headers = { "Authorization" : f "Bearer { access_token } " }
response = requests.get(url, headers = headers)
status = response.json()
print ( f "Status: { status[ 'status' ] } " )
print ( f "Progress: { status[ 'progress' ] * 100 } %" )
print ( f "Processed: { status[ 'processed_count' ] } / { status[ 'total' ] } " )
Path Validation
Validate file paths before processing to catch errors early:
url = "http://localhost:8000/api/batch/validate-paths"
payload = {
"paths" : [
{ "path" : "https://example.com/doc1.pdf" , "type" : "url" },
{ "path" : "s3://my-bucket/doc2.pdf" , "type" : "s3" },
{ "path" : "/local/path/doc3.pdf" , "type" : "local" }
]
}
headers = { "Authorization" : f "Bearer { access_token } " }
response = requests.post(url, json = payload, headers = headers)
results = response.json()
for result in results[ 'results' ]:
if result[ 'valid' ]:
print ( f "✓ { result[ 'path' ] } - { result[ 'content_type' ] } " )
else :
print ( f "✗ { result[ 'path' ] } - { result[ 'error' ] } " )
print ( f " \n Valid: { results[ 'valid_count' ] } , Invalid: { results[ 'invalid_count' ] } " )
Best Practices
Optimize Page Count Start with 3 pages for most documents. Increase for complex docs, decrease for simple ones to save API costs.
Use Exclusion Lists Create organization-specific exclusion lists to filter common terms that don’t add search value.
Choose Right Model
gpt-4o-mini: Best balance of speed/cost
gemini-flash-1.5: Excellent for multilingual
claude-3-haiku: Highest quality
Handle OCR Confidence For scanned docs, check ocr_confidence. If < 70%, consider manual review or higher-quality scan.