Transform Documents Into Searchable Knowledge
Meta-Data Tag Generator is an AI-powered system that automatically extracts meaningful metadata tags from documents, making them instantly searchable and discoverable. Whether you’re processing government reports, legal documents, or multilingual archives, our hybrid OCR approach ensures accurate text extraction and intelligent tag generation.Key Features
AI-Powered Tagging
Generate contextual metadata tags using OpenRouter API with support for multiple AI models including GPT-4, Gemini, and Claude
Hybrid OCR System
Three-tier extraction: PyPDF2 for digital PDFs, Tesseract for fast OCR, and EasyOCR for complex scripts with 80+ language support
Batch Processing
Process hundreds of documents with real-time WebSocket progress updates and intelligent rate limiting
Multilingual Support
Support for all Indian languages including Hindi, Tamil, Telugu, Bengali, Kannada, Malayalam, Marathi, and more
Smart Filtering
Use exclusion lists to filter out generic terms and ensure tags are specific and meaningful for search
Flexible Input
Process documents from file uploads, public URLs, CloudFront, S3, or batch CSV with automatic validation
How It Works
Upload Your Document
Upload a PDF file directly or provide a URL to a publicly accessible document. Supports CloudFront, S3, and standard HTTP/HTTPS URLs.
Configure AI Settings
Enter your OpenRouter API key, select your preferred AI model (GPT-4, Gemini, Claude, etc.), and set the number of tags to generate.
Optional: Add Exclusion List
Upload a text or PDF file containing terms to exclude from tag generation, ensuring tags are specific to your domain.
Process & Extract
The system automatically detects if your document is scanned and applies the optimal OCR method:
- Digital PDFs: Fast text extraction with PyPDF2
- Scanned PDFs (English/Hindi): Tesseract OCR for speed
- Complex Scripts: Automatic fallback to EasyOCR for accuracy
Generate Tags
AI analyzes the extracted text and generates contextual metadata tags categorized into:
- Names: Specific entities, programs, organizations
- Subjects: Topics, beneficiaries, domains
- Actions: Purpose, document type, context
Technical Architecture
The system uses a 3-tier extraction strategy to balance speed and accuracy. Digital PDFs are processed in under 2 seconds, while scanned documents may take 10-30 seconds depending on complexity.
Use Cases
Government Document Archives
Government Document Archives
Tag thousands of policy documents, circulars, and reports with metadata for easy search and retrieval. Automatically extracts scheme names, notification numbers, and ministry information.
Legal Document Management
Legal Document Management
Process legal documents with automatic extraction of act names, section numbers, and case references. Supports both English and regional language documents.
Research Paper Indexing
Research Paper Indexing
Generate keywords and metadata tags for academic papers, technical reports, and research publications. Supports multilingual content.
Digital Library Enhancement
Digital Library Enhancement
Enhance existing digital libraries with searchable metadata tags. Batch process entire collections with CSV import/export.
Supported Languages
Our hybrid OCR system supports 80+ languages including:Indian Languages
- Hindi (हिन्दी)
- Tamil (தமிழ்)
- Telugu (తెలుగు)
- Bengali (বাংলা)
- Kannada (ಕನ್ನಡ)
- Malayalam (മലയാളം)
- Marathi (मराठी)
- Gujarati (ગુજરાતી)
- Punjabi (ਪੰਜਾਬੀ)
- Odia (ଓଡ଼ିଆ)
International
- English
- Spanish
- French
- German
- Chinese
- Japanese
- Korean
- Arabic
- Russian
- And 60+ more
Mixed Language
Documents with multiple languages are automatically detected and processed correctly with language-aware tag generation.
Quick Start
Installation
Get started with Docker Compose in under 5 minutes
Quick Start Guide
Process your first document and understand the workflow
System Requirements
Minimum
- 2 CPU cores
- 4GB RAM
- 5GB disk space
- Docker & Docker Compose
Recommended
- 4+ CPU cores
- 8GB+ RAM
- 20GB disk space
- SSD storage for database
What’s Next?
Installation
Set up the system using Docker Compose
Quick Start
Process your first document in minutes
API Reference
Integrate with your applications