Overview
Sentinel AI uses a sophisticated Retrieval-Augmented Generation (RAG) pipeline to query technical documentation during the diagnosis phase. The system combines:- Query rewriting for multi-perspective search
- Vector similarity search via Pinecone
- Cohere reranking for relevance optimization
- LLM synthesis for coherent answers
Architecture
The RAG system is implemented insrc/core/knowledge.py as the VectorKnowledgeBase class:
Query Pipeline
def _rewrite_query(self, query_text: str, llm) -> list:
rewrite_prompt = (
"You are a search query optimizer for technical documentation "
"about PostgreSQL, Docker, and Nginx.\n"
"The documents are written in BOTH English and Spanish.\n"
"Given the user's question, generate exactly 5 search queries.\n"
"Rules:\n"
"- Query 1: Rephrase in English using official doc terminology\n"
"- Query 2: Rephrase in Spanish using technical terminology\n"
"- Query 3: Use specific section names, table names, or chapter references\n"
"- Query 4: List specific keywords/values the answer would contain\n"
"- Query 5: Alternative interpretation — different but related question\n"
"Return ONLY the 5 queries, one per line, no numbering.\n\n"
f"User question: {query_text}"
)
response = llm.complete(rewrite_prompt)
queries = [q.strip() for q in str(response).strip().split("\n") if q.strip()]
queries.insert(0, query_text) # Prepend original query
return queries[:6]
Multi-language support: The system generates queries in both English and Spanish to match the bilingual documentation corpus.
search_queries = self._rewrite_query(query_text, self.llm)
retriever = self.index.as_retriever(similarity_top_k=5)
all_nodes = []
seen_ids = set()
for sq in search_queries:
nodes = retriever.retrieve(sq)
for node in nodes:
if node.node_id not in seen_ids:
seen_ids.add(node.node_id)
all_nodes.append(node)
text-embedding-3-small (1536 dimensions)node_idenglish_query = search_queries[1] if len(search_queries) > 1 else query_text
reranked_nodes = self.reranker.postprocess_nodes(all_nodes, query_str=english_query)
Reranking reduces the candidate set from ~30 chunks to the 5 most relevant, significantly improving answer quality.
context_str = "\n\n---\n\n".join([node.get_content() for node in reranked_nodes])
qa_template_str = (
"Eres un asistente tecnico experto. "
"Se te proporcionan fragmentos de documentacion tecnica oficial.\n"
"Analiza TODOS los fragmentos de contexto y sintetiza una respuesta completa.\n"
"---------------------\n"
f"{context_str}\n"
"---------------------\n"
f"Pregunta: {query_text}\n\n"
"REGLAS ESTRICTAS:\n"
"1. Responde SIEMPRE en espanol.\n"
"2. Usa UNICAMENTE la informacion presente en el contexto anterior.\n"
"3. NUNCA inventes, supongas ni agregues informacion que NO este en el contexto.\n"
"4. Si el contexto contiene tablas, reproducelas completas y fielmente.\n"
"5. Si la informacion esta repartida en varios fragmentos, combinalos.\n"
"6. Si el contexto NO contiene informacion suficiente, di exactamente:\n"
" 'No encontre informacion especifica sobre eso en los documentos cargados.'\n"
)
response = self.llm.complete(qa_template_str)
final_response = str(response)
Hallucination prevention: The prompt explicitly instructs the LLM to only use information from the provided context, never inventing details.
Streaming Interface
For the chat UI, the system provides a streaming version that yields incremental results. Implementation (src/core/knowledge.py:172-235):
Configuration
RAG system parameters are defined insrc/core/config.py:
Query Example
When the diagnose node queries the knowledge base:"Servicio 'postgresql' no esta activo"
-
Rewritten queries:
- “PostgreSQL service not running troubleshooting”
- “Servicio PostgreSQL caído solución”
- “pg_ctl restart command”
- “postgresql.conf, pg_hba.conf, listen_addresses”
- “Database connection refused fix”
- Retrieved chunks: 30 candidates from manuals
- Reranked chunks: Top 5 most relevant (e.g., PostgreSQL service management, pg_ctl commands, common errors)
-
Synthesized response: “Para reiniciar PostgreSQL, usa
sudo service postgresql restart. Si el problema persiste, verifica los logs en/var/log/postgresql/...”
Performance Characteristics
Query Latency
- Query rewriting: ~1-2s
- Vector retrieval: ~500ms
- Reranking: ~300ms
- LLM synthesis: ~3-5s
- Total: ~5-8 seconds
Retrieval Metrics
- Candidates retrieved: ~30 chunks
- Final context: 5 chunks
- Avg context size: 2-3k tokens
- Deduplication rate: ~20%
Key Design Decisions
Why query rewriting?
Why query rewriting?
Multiple perspectives increase recall. A single query may miss relevant documents due to vocabulary mismatch (e.g., “service down” vs “daemon stopped” vs “proceso caído”).
Why Cohere reranking?
Why Cohere reranking?
Vector similarity alone produces false positives. Neural reranking uses cross-attention between query and document to better assess relevance, improving precision by 30-40%.
Why bilingual prompts?
Why bilingual prompts?
The documentation corpus contains both English (official docs) and Spanish (translated/local docs). Bilingual search ensures comprehensive coverage.
Why temperature=0?
Why temperature=0?
Deterministic responses are critical for DevOps. The same error should always produce the same diagnosis, enabling reproducibility and debugging.
Related Documentation
- Knowledge Base - How documentation is indexed and stored
- Agent Workflow - How the diagnose node uses RAG
- Security - Query sanitization and response validation