System Architecture Overview
SIAA (Sistema Inteligente de Apoyo Administrativo) is an intelligent judicial document management system built for the Seccional Bucaramanga of Colombia’s Judicial Branch. It uses AI-powered document routing and retrieval to answer queries about judicial procedures, regulations, and administrative processes.System Components
Component Details
Flask Proxy Server
The proxy server (siaa_proxy.py) acts as the central orchestrator, handling:
- Request routing and validation
- Cache management
- Document retrieval coordination
- Ollama API communication
- Quality monitoring and logging
The proxy runs on Waitress WSGI server with 16 threads (
HILOS_SERVIDOR=16) for production deployment.Ollama LLM Engine
SIAA uses Qwen2.5:3b model via Ollama’s local API:siaa_proxy.py
Document Store
Documents are loaded from/opt/siaa/fuentes at startup:
siaa_proxy.py
LRU Cache System
High-performance response cache with thread-safe LRU eviction:siaa_proxy.py
Data Flow: Query to Response
Why Limit to 2 Concurrent Requests?
- RAM constraints: Qwen2.5:3b requires ~4GB per instance
- CPU bottleneck: Ryzen 5 2600 (6 cores) thrashes with >2 parallel inferences
- Response quality: More concurrency = slower per-token generation
Health Monitoring System
Automatic health checks run every 15 seconds:siaa_proxy.py
Warm-up Process
On first successful connection, the monitor sends a minimal query ("ok" with 1 token prediction) to:
- Load the model into RAM (prevents 30s delay on first real query)
- Initialize CUDA/ROCm context
- Verify model availability
Check System Status
warmup_completado, usuarios_activos, cache stats, and Ollama availability.Quality Monitoring and Logging
Every query is logged to/opt/siaa/logs/calidad.jsonl (JSONL format for easy analysis):
siaa_proxy.py
Hallucination Detection
The system automatically flags potential hallucinations:- POSIBLE_ALUCINACION: Model said “no encontré” despite receiving relevant documents
- SIN_CONTEXTO: No documents found (expected “no encontré”)
View Quality Logs
Configuration Reference
| Parameter | Value | Purpose |
|---|---|---|
OLLAMA_URL | http://localhost:11434 | Ollama API endpoint |
MODEL | qwen2.5:3b | LLM model identifier |
MAX_OLLAMA_SIMULTANEOS | 2 | Concurrent Ollama requests |
HILOS_SERVIDOR | 16 | Waitress worker threads |
TIMEOUT_CONEXION | 8 | Connection timeout (seconds) |
TIMEOUT_RESPUESTA | 180 | Response timeout (seconds) |
CARPETA_FUENTES | /opt/siaa/fuentes | Document source directory |
MAX_DOCS_CONTEXTO | 2 | Max documents per query |
CHUNK_SIZE | 800 | Characters per chunk |
CHUNK_OVERLAP | 300 | Overlap between chunks |
MAX_CHUNKS_CONTEXTO | 3 | Max chunks per document |
CACHE_MAX_ENTRADAS | 200 | Cache capacity |
CACHE_TTL_SEGUNDOS | 3600 | Cache entry lifetime |
LOG_ARCHIVO | /opt/siaa/logs/calidad.jsonl | Quality log path |
LOG_MAX_LINEAS | 5000 | Log rotation threshold |
Performance Characteristics
- Cache hit: ~5ms response time
- Cache miss: 20-45s response time (depending on context size)
- TTFT (Time To First Token): 3-8s with warm model
- Token generation: ~15-20 tokens/second
- Max throughput: 2 concurrent users (semaphore limit)
- Cache hit rate: 30-40% (across 26 court offices)
API Endpoints
| Endpoint | Method | Description |
|---|---|---|
/siaa/chat | POST | Main chat interface (SSE streaming) |
/siaa/status | GET | System health and statistics |
/siaa/ver/<doc> | GET | View document as HTML |
/siaa/log | GET | Quality monitoring log |
/siaa/cache | GET/DELETE | Cache statistics / clear cache |
/siaa/enrutar | GET | Test document routing |
/siaa/fragmento | GET | View extracted fragment |
/siaa/recargar | GET | Reload documents from disk |
Next Steps
- Learn about the document routing algorithm
- Understand chunking strategies
- Explore Ollama integration