Overview
This guide will walk you through processing your first document with the Meta-Data Tag Generator. You’ll learn how to:- Start the application
- Configure AI settings
- Process a single document
- Understand the results
- Try batch processing
This guide assumes you’ve already completed the installation. If not, head there first!
Prerequisites
Before you begin, make sure you have:OpenRouter API key (Get one free)
Docker Compose installed and running
A PDF document to test with
Step 1: Start the Application
Start all services
- Backend API (port 8000)
- Frontend (port 3001)
- PostgreSQL (port 5432)
- MinIO (ports 9000, 9001)
- Redis (port 6379)
Step 2: Access the Web Interface
Open your browser and navigate to:- Single Upload: Process one document at a time
- Batch Processing: Process multiple documents from CSV
Step 3: Configure AI Settings
Before processing documents, you need to configure the AI tagging settings.Select AI Model
Choose from available models:
| Model | Speed | Cost | Best For |
|---|---|---|---|
openai/gpt-4o-mini | Fast | Low | General documents |
google/gemini-flash-1.5 | Fastest | Lowest | High-volume processing |
anthropic/claude-3-haiku | Fast | Low | Complex documents |
openai/gpt-4o | Slower | Higher | Maximum quality |
Recommended for getting started:
openai/gpt-4o-mini offers the best balance of speed, cost, and quality.Step 4: Process Your First Document
Now let’s process a document! You can use either a file upload or a URL.- Upload a File
- Process from URL
Select a PDF file
Click “Choose File” or drag and drop a PDF into the upload area.
Maximum file size: 50MB
Supported format: PDF only
Click 'Process Document'
The system will:
- Detect if the document is scanned or digital
- Extract text using the optimal method
- Send text to AI for tag generation
- Apply any exclusion filters
- Return structured results
Step 5: Understanding the Results
After processing completes, you’ll see detailed results:Result Fields Explained
Generated metadata tags in priority order:
- Names: Specific entities (e.g., “ministry social justice empowerment”)
- Subjects: Topics and domains (e.g., “scheduled castes welfare”)
- Actions: Document purpose (e.g., “annual report 2023 24”)
Method used for text extraction:
pypdf2: Digital PDF (fastest)tesseract_ocr: Scanned PDF with Tesseracteasyocr: Complex scripts with EasyOCR
Whether the document is scanned (requires OCR) or digital (text-based)
OCR confidence score (0-100) when using Tesseract. Lower scores trigger EasyOCR fallback.
Total processing time in seconds including text extraction and AI generation
Step 6: Using Exclusion Lists (Optional)
Exclusion lists help filter out generic terms that appear frequently across documents.Create an exclusion list file
Create a text file with terms to exclude (one per line):
exclusion-list.txt
Upload the exclusion file
In the configuration panel, click “Upload Exclusion List” and select your file.
Supported formats: .txt, .pdf
Format: One term per line or comma-separated
Step 7: Try Batch Processing
Process multiple documents at once using CSV input.Prepare your CSV
Required columns:
title: Document titlefile_source_type:url,s3, orlocalfile_path: URL or path to the PDF
description: Document descriptionpublishing_date: Publication datefile_size: File size
Upload and process
- Upload your CSV file
- Map columns if needed
- Click “Start Processing”
- Watch real-time progress via WebSocket updates
Batch processing includes intelligent rate limiting to avoid API throttling. Large batches are processed sequentially with exponential backoff.
Common Issues & Solutions
API Authentication Error
API Authentication Error
Error:
Invalid API key. Please check your OpenRouter API key.Solution:- Verify your API key is correct (starts with
sk-or-v1-) - Check if your API key has available credits
- Visit OpenRouter Keys to regenerate
Rate Limit Errors
Rate Limit Errors
Error:
RATE_LIMITED: OpenRouter free tier limit hitSolution:- Free tier has strict rate limits
- Add credits to your OpenRouter account: Billing
- For batch processing, the system automatically adds delays
- Reduce concurrent requests
OCR Produces Gibberish
OCR Produces Gibberish
Issue: Extracted text contains nonsensical charactersSolution:
- Document may be very low quality - try rescanning at higher DPI
- For complex Indian scripts, the system automatically falls back to EasyOCR
- Check OCR confidence score - values below 60% trigger EasyOCR
Document Processing Timeout
Document Processing Timeout
Error: Processing takes too long or times outSolution:
- Reduce
num_pagesto 1-2 for faster processing - Large scanned PDFs take longer (10-30 seconds)
- EasyOCR downloads models on first use (one-time delay)
- Check Docker resource limits (increase RAM to 8GB for OCR)
WebSocket Connection Failed
WebSocket Connection Failed
Issue: Real-time progress not showing in batch processingSolution:
- Check if Redis is running:
docker-compose ps redis - Verify WebSocket endpoint is accessible
- Check browser console for connection errors
- Ensure no firewall blocking WebSocket connections
Performance Tips
For Speed
- Use
google/gemini-flash-1.5model - Process only first 1-2 pages
- Use digital PDFs when possible
- Reduce number of tags to 5
For Quality
- Use
openai/gpt-4ooranthropic/claude-3-opus - Process 3-5 pages
- Use exclusion lists to filter noise
- Request 10-12 tags for more options
For Cost
- Use
google/gemini-flash-1.5(lowest cost) - Process 1-2 pages only
- Batch process to amortize overhead
- Use smaller
num_tagsvalues
For Accuracy
- Ensure documents are high quality
- For scanned docs, use 300+ DPI
- Include document descriptions
- Use language-specific models if available
Next Steps
API Integration
Integrate with your applications using the REST API
Features Deep Dive
Learn about advanced features like exclusion lists and multilingual support
Deployment Guide
Deploy to production on AWS, GCP, or your own infrastructure
Example: Processing a Government Document
Here’s a complete example processing an Indian government document:Congratulations! You’ve successfully processed your first document with the Meta-Data Tag Generator.