Overview
The/siaa/chat endpoint is the primary interface for interacting with SIAA. It processes both conversational queries (greetings, general questions) and document-based queries (procedural, regulatory questions), returning responses via Server-Sent Events (SSE) streaming.
Endpoint
Request
Headers
Must be
application/jsonBody Parameters
Example Request Body
Response
The endpoint returns a Server-Sent Events (SSE) stream withContent-Type: text/event-stream.
Response Headers
text/event-streamno-cacheno (prevents proxy buffering)Present with value
HIT when the response is served from cacheStream Format
Each chunk in the stream follows this format:Example Response Stream
Query Types
Conversational Queries
Short phrases like greetings, thanks, or questions about SIAA itself are handled conversationally without document search:- “Hola”, “Buenos días”
- “¿Qué es SIAA?”
- “Gracias”, “Adiós”
Document Queries
Questions about judicial procedures, regulations, or administrative processes trigger document retrieval and RAG-based responses:- Questions containing judicial/technical terms (SIERJU, PSAA, acuerdo, juzgado, etc.)
- Questions longer than 8 characters that aren’t conversational
Clarification Responses
When a query is ambiguous (e.g., “juzgado civil” without specifying municipal or circuito), SIAA responds with clarification options:Cache Behavior
Cache Hit: If the same question has been asked recently (within 1 hour), the cached response is returned with headerX-Cache: HIT. Cache keys are normalized (case-insensitive, accent-insensitive, punctuation-removed).
Cache Miss: New queries trigger:
- Document routing (TF-IDF + density + filename matching)
- Chunk extraction (sliding window with overlap)
- LLM inference (Ollama)
- Cache storage (if successful)
- TTL: 3600 seconds (1 hour)
- LRU eviction when cache reaches 200 entries
- Cleared on document reload
- Negative responses (“no encontré”) are NOT cached
Error Responses
Error messages are streamed as SSE events:⏳ Sistema ocupado. Intente en 30 segundos. - Max concurrent requests reached⚠ IA no responde. Intente de nuevo. - Ollama connection timeout⏱ Consulta tomó demasiado tiempo. - Response generation timeout (180s)⚠ Servidor IA reiniciándose. Espere 1 minuto. - Ollama service unavailableExamples
Example 1: Document Query
Example 2: Conversational Query
Example 3: Multi-turn Conversation
Implementation Notes
Document Routing
The system uses a multi-level routing algorithm:- TF-IDF keywords (auto-generated + manual)
- Density index (term frequency normalized by document)
- Filename matching (pattern detection for PSAA, PCSJA, acuerdo, etc.)
Chunk Selection Strategy
Depending on query confidence, the system selects chunks dynamically:- Francotirador (ratio ≥3.0): 1 chunk (~800 chars) - high confidence
- Binóculo (ratio ≥1.8): 2 chunks (~1600 chars) - medium confidence
- Escopeta (ratio <1.8): 3 chunks (~2400 chars) - low confidence
- Listado (enumeration queries): Minimum 2 chunks regardless of ratio
Context Window Management
num_ctxis dynamically adjusted based on context size:- <400 tokens:
num_ctx=1024 - 400-900 tokens:
num_ctx=2048 - >900 tokens:
num_ctx=3072
- <400 tokens:
Quality Monitoring
All queries are logged to/opt/siaa/logs/calidad.jsonl with:
- Timestamp, query type, question, response preview
- Documents used, context size, response time
- Automatic alert detection (POSIBLE_ALUCINACION, SIN_CONTEXTO)
Rate Limiting
The system enforces:- Max concurrent Ollama requests: 2 (configurable via
MAX_OLLAMA_SIMULTANEOS) - Connection timeout: 8 seconds
- Response timeout: 180 seconds
COLA_LLENA error.