Skip to main content

Transform Documents Into Searchable Knowledge

Meta-Data Tag Generator is an AI-powered system that automatically extracts meaningful metadata tags from documents, making them instantly searchable and discoverable. Whether you’re processing government reports, legal documents, or multilingual archives, our hybrid OCR approach ensures accurate text extraction and intelligent tag generation.

Key Features

AI-Powered Tagging

Generate contextual metadata tags using OpenRouter API with support for multiple AI models including GPT-4, Gemini, and Claude

Hybrid OCR System

Three-tier extraction: PyPDF2 for digital PDFs, Tesseract for fast OCR, and EasyOCR for complex scripts with 80+ language support

Batch Processing

Process hundreds of documents with real-time WebSocket progress updates and intelligent rate limiting

Multilingual Support

Support for all Indian languages including Hindi, Tamil, Telugu, Bengali, Kannada, Malayalam, Marathi, and more

Smart Filtering

Use exclusion lists to filter out generic terms and ensure tags are specific and meaningful for search

Flexible Input

Process documents from file uploads, public URLs, CloudFront, S3, or batch CSV with automatic validation

How It Works

1

Upload Your Document

Upload a PDF file directly or provide a URL to a publicly accessible document. Supports CloudFront, S3, and standard HTTP/HTTPS URLs.
2

Configure AI Settings

Enter your OpenRouter API key, select your preferred AI model (GPT-4, Gemini, Claude, etc.), and set the number of tags to generate.
3

Optional: Add Exclusion List

Upload a text or PDF file containing terms to exclude from tag generation, ensuring tags are specific to your domain.
4

Process & Extract

The system automatically detects if your document is scanned and applies the optimal OCR method:
  • Digital PDFs: Fast text extraction with PyPDF2
  • Scanned PDFs (English/Hindi): Tesseract OCR for speed
  • Complex Scripts: Automatic fallback to EasyOCR for accuracy
5

Generate Tags

AI analyzes the extracted text and generates contextual metadata tags categorized into:
  • Names: Specific entities, programs, organizations
  • Subjects: Topics, beneficiaries, domains
  • Actions: Purpose, document type, context
6

Export & Use

Download results as CSV or JSON with complete metadata including extraction method, OCR confidence, and processing time.

Technical Architecture

The system uses a 3-tier extraction strategy to balance speed and accuracy. Digital PDFs are processed in under 2 seconds, while scanned documents may take 10-30 seconds depending on complexity.

Use Cases

Tag thousands of policy documents, circulars, and reports with metadata for easy search and retrieval. Automatically extracts scheme names, notification numbers, and ministry information.
Generate keywords and metadata tags for academic papers, technical reports, and research publications. Supports multilingual content.
Enhance existing digital libraries with searchable metadata tags. Batch process entire collections with CSV import/export.

Supported Languages

Our hybrid OCR system supports 80+ languages including:

Indian Languages

  • Hindi (हिन्दी)
  • Tamil (தமிழ்)
  • Telugu (తెలుగు)
  • Bengali (বাংলা)
  • Kannada (ಕನ್ನಡ)
  • Malayalam (മലയാളം)
  • Marathi (मराठी)
  • Gujarati (ગુજરાતી)
  • Punjabi (ਪੰਜਾਬੀ)
  • Odia (ଓଡ଼ିଆ)

International

  • English
  • Spanish
  • French
  • German
  • Chinese
  • Japanese
  • Korean
  • Arabic
  • Russian
  • And 60+ more

Mixed Language

Documents with multiple languages are automatically detected and processed correctly with language-aware tag generation.

Quick Start

Installation

Get started with Docker Compose in under 5 minutes

Quick Start Guide

Process your first document and understand the workflow

System Requirements

Minimum

  • 2 CPU cores
  • 4GB RAM
  • 5GB disk space
  • Docker & Docker Compose

Recommended

  • 4+ CPU cores
  • 8GB+ RAM
  • 20GB disk space
  • SSD storage for database
EasyOCR models require significant memory when processing complex scripts. For batch processing of scanned documents, we recommend at least 8GB RAM.

What’s Next?

Installation

Set up the system using Docker Compose

Quick Start

Process your first document in minutes

API Reference

Integrate with your applications

Build docs developers (and LLMs) love