Architecture Overview
Meta-Data Tag Generator is a microservices-based document processing system that extracts text from PDFs, analyzes content using AI, and generates relevant metadata tags.System Diagram
Service Details
Frontend Service (Next.js)
Technology: Next.js 14+ with React, TypeScript Responsibilities:- User interface for document upload and management
- Display generated tags and metadata
- Tag editing and categorization
- User authentication flows
- Real-time job status updates
- Multi-stage Docker build for minimal image size
- Runs as non-root user (nextjs:nodejs)
- Standalone output mode for optimal performance
- Health check endpoint on port 3000
- Connects to backend via
NEXT_PUBLIC_BACKEND_URL - Client-side API calls to backend REST endpoints
- WebSocket/SSE for real-time updates (if implemented)
Backend Service (FastAPI)
Technology: Python 3.8, FastAPI, Uvicorn Responsibilities:- RESTful API for all operations
- PDF text extraction (PyPDF2, pdfplumber)
- OCR processing:
- Tesseract OCR (English, Hindi)
- EasyOCR (multi-language support)
- AI-powered tag generation via OpenRouter
- User authentication (JWT)
- Job queue management
- File storage orchestration
- Memory: 1GB minimum, 4GB maximum
- Required for EasyOCR model loading and inference
- CPU-only PyTorch for smaller footprint
- PostgreSQL: User data, document metadata, job status
- MinIO: File storage (upload/download)
- Redis: Job state, caching, pub/sub
- Wait for PostgreSQL health check (10s intervals)
- Wait for MinIO health check (30s intervals)
- Wait for Redis health check (10s intervals)
- Load EasyOCR models (cached in volume)
- Start Uvicorn server on port 8000
- Health check available at
/api/health
PostgreSQL Database
Technology: PostgreSQL 15 Alpine Responsibilities:- Persistent storage for all structured data
- User authentication data
- Document metadata and relationships
- Tag definitions and associations
- Processing job history
- users: Authentication and profile data
- documents: PDF metadata (filename, size, upload date)
- tags: Generated tags with confidence scores
- jobs: Processing job status and results
- categories: User-defined tag categories
MinIO Object Storage
Technology: MinIO (S3-compatible) Responsibilities:- Store uploaded PDF files
- Store OCR-extracted text files
- Store processing results and artifacts
- Serve files via presigned URLs
metatag-files: All document storageuploads/: Original PDFsocr-results/: Extracted textexports/: Generated outputs
- API (Port 9000): Backend service access
- Console (Port 9001): Web UI for management
Redis
Technology: Redis 7 Alpine Responsibilities:- Job queue state management
- Caching AI responses (optional)
- Pub/sub for real-time updates
- Session storage (if needed)
- Job Queue: Track processing status
- Rate Limiting: Prevent API abuse
- Caching: Store frequently accessed data
- Pub/Sub: Real-time notifications to frontend
Nginx Reverse Proxy
Technology: Nginx Alpine (Production only) Responsibilities:- SSL/TLS termination
- Load balancing (if scaled)
- Static file serving
- Request routing
- Security headers
production profile.
Data Flow
Document Upload Flow
Tag Generation Flow
Authentication Flow
Network Architecture
Docker Network
meta-tag-network bridge network.
Internal Hostnames:
postgres:5432- Databaseminio:9000- Object storageredis:6379- Cachebackend:8000- API serverfrontend:3000- Web app
localhost:5432→ PostgreSQLlocalhost:9000/9001→ MinIOlocalhost:6379→ Redislocalhost:8000→ Backend APIlocalhost:3001→ Frontendlocalhost:80/443→ Nginx (production)
Service Dependencies
- PostgreSQL, MinIO, Redis (parallel)
- Wait for all health checks to pass
- Backend starts (waits 40s for OCR models)
- Frontend starts
- Nginx starts (production only)
Storage Architecture
Persistent Volumes
File Organization in MinIO
Scalability Considerations
Horizontal Scaling
Stateless Services (can scale):- ✅ Backend (FastAPI)
- ✅ Frontend (Next.js)
- ✅ Nginx (load balancer)
- ⚠️ PostgreSQL (use managed service or replication)
- ⚠️ MinIO (supports distributed mode)
- ⚠️ Redis (use Redis Cluster for HA)
Resource Requirements
Minimum (Development):- 4GB RAM
- 2 CPU cores
- 10GB disk space
- 8GB+ RAM
- 4+ CPU cores
- 50GB+ disk space (depends on file storage)
- PostgreSQL: ~200MB
- MinIO: ~100MB
- Redis: ~50MB
- Backend: 1-4GB (OCR models)
- Frontend: ~100MB
- Nginx: ~10MB
Performance Optimization
Backend Optimization:- Cache EasyOCR models in volume
- Use CPU-optimized PyTorch
- Async processing with background tasks
- Connection pooling for PostgreSQL
- Multi-stage Docker build
- Standalone output (minimal dependencies)
- Static asset compression
- CDN for static files (production)
- Indexed columns (user_id, document_id)
- Connection pooling (SQLAlchemy)
- Regular VACUUM and ANALYZE
Security Architecture
Authentication & Authorization
- User logs in with credentials
- Backend generates JWT (HS256)
- Frontend stores token
- Token included in
Authorization: Bearer <token> - Backend validates signature and expiry
- Extract user_id from token claims
Network Security
Internal Network:- All services on isolated bridge network
- No direct external access except via exposed ports
- PostgreSQL: 5432 (disable in production)
- MinIO: 9000, 9001 (restrict in production)
- Redis: 6379 (disable in production)
- Backend: 8000
- Frontend: 3001
- Only expose ports 80/443 via Nginx
- Use environment variables for secrets
- Enable HTTPS/TLS
- Change default passwords
- Use strong JWT secret
Data Security
At Rest:- PostgreSQL: Transparent data encryption (TDE) optional
- MinIO: Server-side encryption (SSE) optional
- Redis: AOF encrypted via disk encryption
- HTTPS via Nginx (production)
- Internal network encrypted via Docker overlay (optional)
- MinIO secure mode (
MINIO_SECURE=true)
Monitoring & Health Checks
Health Check Endpoints
| Service | Endpoint | Interval | Timeout |
|---|---|---|---|
| Backend | http://localhost:8000/api/health | 30s | 10s |
| Frontend | http://localhost:3000 | 30s | 10s |
| PostgreSQL | pg_isready -U metatag | 10s | 5s |
| MinIO | http://localhost:9000/minio/health/live | 30s | 10s |
| Redis | redis-cli ping | 10s | 5s |
Service Status
Metrics to Monitor
System Metrics:- CPU usage per container
- Memory usage (especially backend)
- Disk I/O (PostgreSQL, MinIO)
- Network traffic
- Request rate (requests/second)
- Response time (p50, p95, p99)
- Error rate (4xx, 5xx)
- Job queue depth (Redis)
- Processing time per document
- Documents processed per day
- Average tags per document
- User activity (uploads, searches)
- Storage usage (MinIO)
Disaster Recovery
Backup Strategy
PostgreSQL:Recovery Procedures
Complete System Recovery:Next Steps
- Review Docker Compose deployment
- Configure environment variables
- Set up monitoring and alerting
- Configure automated backups
- Review security hardening guide