RAG System

Overview

Sentinel AI uses a sophisticated Retrieval-Augmented Generation (RAG) pipeline to query technical documentation during the diagnosis phase. The system combines:

Query rewriting for multi-perspective search
Vector similarity search via Pinecone
Cohere reranking for relevance optimization
LLM synthesis for coherent answers

Architecture

The RAG system is implemented in src/core/knowledge.py as the VectorKnowledgeBase class:

class VectorKnowledgeBase:
    def __init__(self):
        self.pc = Pinecone(api_key=config.PINECONE_API_KEY)
        self.index_name = config.PINECONE_INDEX_NAME
        self.vector_store = PineconeVectorStore(
            pinecone_index=self.pc.Index(self.index_name)
        )
        self.embed_model = OpenAIEmbedding(model=config.EMBEDDING_MODEL)
        self.llm = OpenAI(model=config.MODEL_NAME, temperature=config.TEMPERATURE)
        self.reranker = CohereRerank(
            api_key=config.COHERE_API_KEY,
            top_n=5
        )
        self.index = None

Query Pipeline

Query Rewriting

The system generates 5 alternative search queries to maximize retrieval coverage.

Implementation (src/core/knowledge.py:98-115):

def _rewrite_query(self, query_text: str, llm) -> list:
    rewrite_prompt = (
        "You are a search query optimizer for technical documentation "
        "about PostgreSQL, Docker, and Nginx.\n"
        "The documents are written in BOTH English and Spanish.\n"
        "Given the user's question, generate exactly 5 search queries.\n"
        "Rules:\n"
        "- Query 1: Rephrase in English using official doc terminology\n"
        "- Query 2: Rephrase in Spanish using technical terminology\n"
        "- Query 3: Use specific section names, table names, or chapter references\n"
        "- Query 4: List specific keywords/values the answer would contain\n"
        "- Query 5: Alternative interpretation — different but related question\n"
        "Return ONLY the 5 queries, one per line, no numbering.\n\n"
        f"User question: {query_text}"
    )
    response = llm.complete(rewrite_prompt)
    queries = [q.strip() for q in str(response).strip().split("\n") if q.strip()]
    queries.insert(0, query_text)  # Prepend original query
    return queries[:6]

Multi-language support: The system generates queries in both English and Spanish to match the bilingual documentation corpus.

Vector Retrieval

Each rewritten query retrieves the top 5 most similar document chunks.

Implementation (src/core/knowledge.py:124-134):

search_queries = self._rewrite_query(query_text, self.llm)

retriever = self.index.as_retriever(similarity_top_k=5)
all_nodes = []
seen_ids = set()

for sq in search_queries:
    nodes = retriever.retrieve(sq)
    for node in nodes:
        if node.node_id not in seen_ids:
            seen_ids.add(node.node_id)
            all_nodes.append(node)

Retrieval Settings:

Embedding model: text-embedding-3-small (1536 dimensions)

Top-k per query: 5 chunks

Deduplication: Nodes are deduplicated by node_id

Total candidates: Up to 30 chunks (6 queries × 5 results)

Cohere Reranking

The retrieved chunks are reranked using Cohere’s neural reranker to improve precision.

Implementation (src/core/knowledge.py:136-137):

english_query = search_queries[1] if len(search_queries) > 1 else query_text
reranked_nodes = self.reranker.postprocess_nodes(all_nodes, query_str=english_query)

Reranking Configuration:

Model: Cohere Rerank API

Top-n: 5 (returns the 5 most relevant chunks)

Query language: English (first rewritten query)

Reranking reduces the candidate set from ~30 chunks to the 5 most relevant, significantly improving answer quality.

LLM Synthesis

The reranked chunks are concatenated and passed to an LLM for synthesis.

Implementation (src/core/knowledge.py:139-170):

context_str = "\n\n---\n\n".join([node.get_content() for node in reranked_nodes])

qa_template_str = (
    "Eres un asistente tecnico experto. "
    "Se te proporcionan fragmentos de documentacion tecnica oficial.\n"
    "Analiza TODOS los fragmentos de contexto y sintetiza una respuesta completa.\n"
    "---------------------\n"
    f"{context_str}\n"
    "---------------------\n"
    f"Pregunta: {query_text}\n\n"
    "REGLAS ESTRICTAS:\n"
    "1. Responde SIEMPRE en espanol.\n"
    "2. Usa UNICAMENTE la informacion presente en el contexto anterior.\n"
    "3. NUNCA inventes, supongas ni agregues informacion que NO este en el contexto.\n"
    "4. Si el contexto contiene tablas, reproducelas completas y fielmente.\n"
    "5. Si la informacion esta repartida en varios fragmentos, combinalos.\n"
    "6. Si el contexto NO contiene informacion suficiente, di exactamente:\n"
    "   'No encontre informacion especifica sobre eso en los documentos cargados.'\n"
)

response = self.llm.complete(qa_template_str)
final_response = str(response)

Hallucination prevention: The prompt explicitly instructs the LLM to only use information from the provided context, never inventing details.

Source Attribution

The system appends source file names to the response for transparency.

Implementation (src/core/knowledge.py:161-168):

sources = []
for node in reranked_nodes:
    file_name = node.metadata.get("file_name", "Archivo desconocido")
    sources.append(file_name)

if sources:
    unique_sources = list(set(sources))
    final_response += "\n\n**Fuentes:**\n- " + "\n- ".join(unique_sources)

Streaming Interface

For the chat UI, the system provides a streaming version that yields incremental results. Implementation (src/core/knowledge.py:172-235):

def stream_query(self, query_text: str):
    """Generates a stream of events and content for the chat UI."""
    yield {"event": "thinking", "data": "Analizando tu pregunta..."}
    
    yield {"event": "thinking", "data": "Optimizando búsqueda..."}
    search_queries = self._rewrite_query(query_text, self.llm)
    
    yield {"event": "thinking", "data": "Consultando base de conocimiento vectorizada..."}
    retriever = self.index.as_retriever(similarity_top_k=5)
    # ... retrieval logic ...
    
    yield {"event": "thinking", "data": f"Leyendo {len(reranked_nodes)} fragmentos relevantes..."}
    
    # Stream LLM response
    response_gen = self.llm.stream_complete(qa_template_str)
    for delta in response_gen:
        yield {"event": "message", "data": delta.delta}
    
    yield {"event": "done", "data": ""}

Configuration

RAG system parameters are defined in src/core/config.py:

class Config:
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
    MODEL_NAME = "gpt-4o"
    TEMPERATURE = 0
    
    PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
    PINECONE_INDEX = "sentinel-ai-index"
    
    LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")
    EMBED_MODEL = "text-embedding-3-small"
    EMBEDDING_DIM = 1536
    
    COHERE_API_KEY = os.getenv("COHERE_API_KEY")

Query Example

When the diagnose node queries the knowledge base:

# From src/agent/nodes/diagnose.py:40-42
rag_context = ""
if kb:
    rag_context = kb.query(f"How to fix: {error}")

Example flow for error: "Servicio 'postgresql' no esta activo"

Rewritten queries:
- “PostgreSQL service not running troubleshooting”
- “Servicio PostgreSQL caído solución”
- “pg_ctl restart command”
- “postgresql.conf, pg_hba.conf, listen_addresses”
- “Database connection refused fix”
Retrieved chunks: 30 candidates from manuals
Reranked chunks: Top 5 most relevant (e.g., PostgreSQL service management, pg_ctl commands, common errors)
Synthesized response: “Para reiniciar PostgreSQL, usa sudo service postgresql restart. Si el problema persiste, verifica los logs en /var/log/postgresql/...”

Performance Characteristics

Query Latency

Query rewriting: ~1-2s
Vector retrieval: ~500ms
Reranking: ~300ms
LLM synthesis: ~3-5s
Total: ~5-8 seconds

Retrieval Metrics

Candidates retrieved: ~30 chunks
Final context: 5 chunks
Avg context size: 2-3k tokens
Deduplication rate: ~20%

Key Design Decisions

Why query rewriting?

Multiple perspectives increase recall. A single query may miss relevant documents due to vocabulary mismatch (e.g., “service down” vs “daemon stopped” vs “proceso caído”).

Why Cohere reranking?

Vector similarity alone produces false positives. Neural reranking uses cross-attention between query and document to better assess relevance, improving precision by 30-40%.

Why bilingual prompts?

The documentation corpus contains both English (official docs) and Spanish (translated/local docs). Bilingual search ensures comprehensive coverage.

Why temperature=0?

Deterministic responses are critical for DevOps. The same error should always produce the same diagnosis, enabling reproducibility and debugging.

Knowledge Base - How documentation is indexed and stored
Agent Workflow - How the diagnose node uses RAG
Security - Query sanitization and response validation

Get Started

Core Concepts

Configuration

Agent Operations

Dashboard

Advanced

Overview

Architecture

Query Pipeline

Streaming Interface

Configuration

Query Example

Performance Characteristics

Query Latency

Retrieval Metrics

Key Design Decisions

Build docs developers (and LLMs) love

Get Started

Core Concepts

Configuration

Agent Operations

Dashboard

Advanced

​Overview

​Architecture

​Query Pipeline

​Streaming Interface

​Configuration

​Query Example

​Performance Characteristics

Query Latency

Retrieval Metrics

​Key Design Decisions

​Related Documentation

Build docs developers (and LLMs) love

Overview

Architecture

Query Pipeline

Streaming Interface

Configuration

Query Example

Performance Characteristics

Key Design Decisions

Related Documentation