Skip to main content

Overview

This guide will walk you through processing your first document with the Meta-Data Tag Generator. You’ll learn how to:
  • Start the application
  • Configure AI settings
  • Process a single document
  • Understand the results
  • Try batch processing
This guide assumes you’ve already completed the installation. If not, head there first!

Prerequisites

Before you begin, make sure you have:
OpenRouter API key (Get one free)
Docker Compose installed and running
A PDF document to test with

Step 1: Start the Application

1

Navigate to project directory

cd /path/to/Meta-Data-Tag-Generator/source
2

Start all services

docker-compose up -d
This starts:
  • Backend API (port 8000)
  • Frontend (port 3001)
  • PostgreSQL (port 5432)
  • MinIO (ports 9000, 9001)
  • Redis (port 6379)
3

Verify services are running

docker-compose ps
All services should show as “healthy” or “running”.
4

Check API health

curl http://localhost:8000/api/health
Expected response:
{
  "status": "healthy",
  "version": "2.0.0",
  "message": "Document Meta-Tagging API is running"
}

Step 2: Access the Web Interface

Open your browser and navigate to:
http://localhost:3001
You should see the Meta-Data Tag Generator interface with two main sections:
  • Single Upload: Process one document at a time
  • Batch Processing: Process multiple documents from CSV

Step 3: Configure AI Settings

Before processing documents, you need to configure the AI tagging settings.
1

Locate the Configuration Panel

On the main page, find the “Configuration” section in the sidebar.
2

Enter your OpenRouter API Key

API Key: sk-or-v1-...
Don’t have an API key? Get one free at OpenRouter. The free tier includes limited requests - add credits for production use.
3

Select AI Model

Choose from available models:
ModelSpeedCostBest For
openai/gpt-4o-miniFastLowGeneral documents
google/gemini-flash-1.5FastestLowestHigh-volume processing
anthropic/claude-3-haikuFastLowComplex documents
openai/gpt-4oSlowerHigherMaximum quality
Recommended for getting started: openai/gpt-4o-mini offers the best balance of speed, cost, and quality.
4

Configure Processing Settings

  • Number of Pages to Extract: 3 (default, processes first 3 pages)
  • Number of Tags: 8 (default, generates 8 metadata tags)
Processing more pages increases accuracy but also increases API costs and processing time. For most documents, 3 pages is sufficient.

Step 4: Process Your First Document

Now let’s process a document! You can use either a file upload or a URL.
1

Select a PDF file

Click “Choose File” or drag and drop a PDF into the upload area.
Maximum file size: 50MB
Supported format: PDF only
2

Preview the document

Once uploaded, you’ll see a PDF preview on the right side of the screen.
3

Click 'Process Document'

The system will:
  1. Detect if the document is scanned or digital
  2. Extract text using the optimal method
  3. Send text to AI for tag generation
  4. Apply any exclusion filters
  5. Return structured results
4

View results

Processing typically takes 2-15 seconds depending on:
  • Document type (digital vs scanned)
  • Number of pages
  • AI model selected

Step 5: Understanding the Results

After processing completes, you’ll see detailed results:
{
  "success": true,
  "document_title": "Annual Report 2023-24",
  "tags": [
    "ministry social justice empowerment",
    "scheduled castes welfare",
    "annual report 2023 24",
    "national safai karamcharis finance",
    "scholarship scheme",
    "financial assistance",
    "budget allocation",
    "performance indicators"
  ],
  "extracted_text_preview": "Ministry of Social Justice and Empowerment\nAnnual Report 2023-24...",
  "processing_time": 3.45,
  "is_scanned": false,
  "extraction_method": "pypdf2",
  "ocr_confidence": null
}

Result Fields Explained

tags
array
Generated metadata tags in priority order:
  • Names: Specific entities (e.g., “ministry social justice empowerment”)
  • Subjects: Topics and domains (e.g., “scheduled castes welfare”)
  • Actions: Document purpose (e.g., “annual report 2023 24”)
extraction_method
string
Method used for text extraction:
  • pypdf2: Digital PDF (fastest)
  • tesseract_ocr: Scanned PDF with Tesseract
  • easyocr: Complex scripts with EasyOCR
is_scanned
boolean
Whether the document is scanned (requires OCR) or digital (text-based)
ocr_confidence
number
OCR confidence score (0-100) when using Tesseract. Lower scores trigger EasyOCR fallback.
processing_time
number
Total processing time in seconds including text extraction and AI generation

Step 6: Using Exclusion Lists (Optional)

Exclusion lists help filter out generic terms that appear frequently across documents.
1

Create an exclusion list file

Create a text file with terms to exclude (one per line):
exclusion-list.txt
# Government organizations (comments start with #)
government-india
ministry-of-social-justice
government-of-india

# Generic document types
annual-report
newsletter
policy-document

# Overly generic terms
empowerment
welfare
scheme
2

Upload the exclusion file

In the configuration panel, click “Upload Exclusion List” and select your file.
Supported formats: .txt, .pdf
Format: One term per line or comma-separated
3

Process with filtering

When you process a document, the system will:
  1. Instruct the AI to avoid excluded terms
  2. Filter any excluded terms that slip through
  3. Request additional tags to maintain the target count
If you request 8 tags and 2 are filtered, the system ensures you still get 8 final tags.

Step 7: Try Batch Processing

Process multiple documents at once using CSV input.
1

Navigate to Batch Processing

Click the “Batch Processing” tab in the navigation.
2

Download CSV Template

Click “Download Template” to get a sample CSV:
title,description,file_source_type,file_path,publishing_date,file_size
"Training Manual","PMSPECIAL training document",url,https://example.com/doc1.pdf,2025-01-15,1.2MB
"Annual Report 2023","Ministry annual report",url,https://example.com/doc2.pdf,2023-12-31,2.5MB
3

Prepare your CSV

Required columns:
  • title: Document title
  • file_source_type: url, s3, or local
  • file_path: URL or path to the PDF
Optional columns:
  • description: Document description
  • publishing_date: Publication date
  • file_size: File size
4

Upload and process

  1. Upload your CSV file
  2. Map columns if needed
  3. Click “Start Processing”
  4. Watch real-time progress via WebSocket updates
Batch processing includes intelligent rate limiting to avoid API throttling. Large batches are processed sequentially with exponential backoff.
5

Export results

Download results as CSV with all metadata:
  • Original document info
  • Generated tags
  • Extraction method
  • Processing time
  • Any errors

Common Issues & Solutions

Error: Invalid API key. Please check your OpenRouter API key.Solution:
  • Verify your API key is correct (starts with sk-or-v1-)
  • Check if your API key has available credits
  • Visit OpenRouter Keys to regenerate
Error: RATE_LIMITED: OpenRouter free tier limit hitSolution:
  • Free tier has strict rate limits
  • Add credits to your OpenRouter account: Billing
  • For batch processing, the system automatically adds delays
  • Reduce concurrent requests
Issue: Extracted text contains nonsensical charactersSolution:
  • Document may be very low quality - try rescanning at higher DPI
  • For complex Indian scripts, the system automatically falls back to EasyOCR
  • Check OCR confidence score - values below 60% trigger EasyOCR
Error: Processing takes too long or times outSolution:
  • Reduce num_pages to 1-2 for faster processing
  • Large scanned PDFs take longer (10-30 seconds)
  • EasyOCR downloads models on first use (one-time delay)
  • Check Docker resource limits (increase RAM to 8GB for OCR)
Issue: Real-time progress not showing in batch processingSolution:
  • Check if Redis is running: docker-compose ps redis
  • Verify WebSocket endpoint is accessible
  • Check browser console for connection errors
  • Ensure no firewall blocking WebSocket connections

Performance Tips

For Speed

  • Use google/gemini-flash-1.5 model
  • Process only first 1-2 pages
  • Use digital PDFs when possible
  • Reduce number of tags to 5

For Quality

  • Use openai/gpt-4o or anthropic/claude-3-opus
  • Process 3-5 pages
  • Use exclusion lists to filter noise
  • Request 10-12 tags for more options

For Cost

  • Use google/gemini-flash-1.5 (lowest cost)
  • Process 1-2 pages only
  • Batch process to amortize overhead
  • Use smaller num_tags values

For Accuracy

  • Ensure documents are high quality
  • For scanned docs, use 300+ DPI
  • Include document descriptions
  • Use language-specific models if available

Next Steps

API Integration

Integrate with your applications using the REST API

Features Deep Dive

Learn about advanced features like exclusion lists and multilingual support

Deployment Guide

Deploy to production on AWS, GCP, or your own infrastructure

Example: Processing a Government Document

Here’s a complete example processing an Indian government document:
curl -X POST http://localhost:8000/api/single/process \
  -H "Authorization: Bearer YOUR_JWT_TOKEN" \
  -F "pdf_url=https://socialjustice.gov.in/writereaddata/UploadFile/AnnualReport2023.pdf" \
  -F 'config={"api_key":"sk-or-v1-...","model_name":"openai/gpt-4o-mini","num_pages":3,"num_tags":8}'
Congratulations! You’ve successfully processed your first document with the Meta-Data Tag Generator.

Build docs developers (and LLMs) love