Skip to main content

Overview

Sentinel AI uses a sophisticated Retrieval-Augmented Generation (RAG) pipeline to query technical documentation during the diagnosis phase. The system combines:
  • Query rewriting for multi-perspective search
  • Vector similarity search via Pinecone
  • Cohere reranking for relevance optimization
  • LLM synthesis for coherent answers

Architecture

The RAG system is implemented in src/core/knowledge.py as the VectorKnowledgeBase class:
class VectorKnowledgeBase:
    def __init__(self):
        self.pc = Pinecone(api_key=config.PINECONE_API_KEY)
        self.index_name = config.PINECONE_INDEX_NAME
        self.vector_store = PineconeVectorStore(
            pinecone_index=self.pc.Index(self.index_name)
        )
        self.embed_model = OpenAIEmbedding(model=config.EMBEDDING_MODEL)
        self.llm = OpenAI(model=config.MODEL_NAME, temperature=config.TEMPERATURE)
        self.reranker = CohereRerank(
            api_key=config.COHERE_API_KEY,
            top_n=5
        )
        self.index = None

Query Pipeline

1
Query Rewriting
2
The system generates 5 alternative search queries to maximize retrieval coverage.
3
Implementation (src/core/knowledge.py:98-115):
4
def _rewrite_query(self, query_text: str, llm) -> list:
    rewrite_prompt = (
        "You are a search query optimizer for technical documentation "
        "about PostgreSQL, Docker, and Nginx.\n"
        "The documents are written in BOTH English and Spanish.\n"
        "Given the user's question, generate exactly 5 search queries.\n"
        "Rules:\n"
        "- Query 1: Rephrase in English using official doc terminology\n"
        "- Query 2: Rephrase in Spanish using technical terminology\n"
        "- Query 3: Use specific section names, table names, or chapter references\n"
        "- Query 4: List specific keywords/values the answer would contain\n"
        "- Query 5: Alternative interpretation — different but related question\n"
        "Return ONLY the 5 queries, one per line, no numbering.\n\n"
        f"User question: {query_text}"
    )
    response = llm.complete(rewrite_prompt)
    queries = [q.strip() for q in str(response).strip().split("\n") if q.strip()]
    queries.insert(0, query_text)  # Prepend original query
    return queries[:6]
5
Multi-language support: The system generates queries in both English and Spanish to match the bilingual documentation corpus.
6
Vector Retrieval
7
Each rewritten query retrieves the top 5 most similar document chunks.
8
Implementation (src/core/knowledge.py:124-134):
9
search_queries = self._rewrite_query(query_text, self.llm)

retriever = self.index.as_retriever(similarity_top_k=5)
all_nodes = []
seen_ids = set()

for sq in search_queries:
    nodes = retriever.retrieve(sq)
    for node in nodes:
        if node.node_id not in seen_ids:
            seen_ids.add(node.node_id)
            all_nodes.append(node)
10
Retrieval Settings:
11
  • Embedding model: text-embedding-3-small (1536 dimensions)
  • Top-k per query: 5 chunks
  • Deduplication: Nodes are deduplicated by node_id
  • Total candidates: Up to 30 chunks (6 queries × 5 results)
  • 12
    Cohere Reranking
    13
    The retrieved chunks are reranked using Cohere’s neural reranker to improve precision.
    14
    Implementation (src/core/knowledge.py:136-137):
    15
    english_query = search_queries[1] if len(search_queries) > 1 else query_text
    reranked_nodes = self.reranker.postprocess_nodes(all_nodes, query_str=english_query)
    
    16
    Reranking Configuration:
    17
  • Model: Cohere Rerank API
  • Top-n: 5 (returns the 5 most relevant chunks)
  • Query language: English (first rewritten query)
  • 18
    Reranking reduces the candidate set from ~30 chunks to the 5 most relevant, significantly improving answer quality.
    19
    LLM Synthesis
    20
    The reranked chunks are concatenated and passed to an LLM for synthesis.
    21
    Implementation (src/core/knowledge.py:139-170):
    22
    context_str = "\n\n---\n\n".join([node.get_content() for node in reranked_nodes])
    
    qa_template_str = (
        "Eres un asistente tecnico experto. "
        "Se te proporcionan fragmentos de documentacion tecnica oficial.\n"
        "Analiza TODOS los fragmentos de contexto y sintetiza una respuesta completa.\n"
        "---------------------\n"
        f"{context_str}\n"
        "---------------------\n"
        f"Pregunta: {query_text}\n\n"
        "REGLAS ESTRICTAS:\n"
        "1. Responde SIEMPRE en espanol.\n"
        "2. Usa UNICAMENTE la informacion presente en el contexto anterior.\n"
        "3. NUNCA inventes, supongas ni agregues informacion que NO este en el contexto.\n"
        "4. Si el contexto contiene tablas, reproducelas completas y fielmente.\n"
        "5. Si la informacion esta repartida en varios fragmentos, combinalos.\n"
        "6. Si el contexto NO contiene informacion suficiente, di exactamente:\n"
        "   'No encontre informacion especifica sobre eso en los documentos cargados.'\n"
    )
    
    response = self.llm.complete(qa_template_str)
    final_response = str(response)
    
    23
    Hallucination prevention: The prompt explicitly instructs the LLM to only use information from the provided context, never inventing details.
    24
    Source Attribution
    25
    The system appends source file names to the response for transparency.
    26
    Implementation (src/core/knowledge.py:161-168):
    27
    sources = []
    for node in reranked_nodes:
        file_name = node.metadata.get("file_name", "Archivo desconocido")
        sources.append(file_name)
    
    if sources:
        unique_sources = list(set(sources))
        final_response += "\n\n**Fuentes:**\n- " + "\n- ".join(unique_sources)
    

    Streaming Interface

    For the chat UI, the system provides a streaming version that yields incremental results. Implementation (src/core/knowledge.py:172-235):
    def stream_query(self, query_text: str):
        """Generates a stream of events and content for the chat UI."""
        yield {"event": "thinking", "data": "Analizando tu pregunta..."}
        
        yield {"event": "thinking", "data": "Optimizando búsqueda..."}
        search_queries = self._rewrite_query(query_text, self.llm)
        
        yield {"event": "thinking", "data": "Consultando base de conocimiento vectorizada..."}
        retriever = self.index.as_retriever(similarity_top_k=5)
        # ... retrieval logic ...
        
        yield {"event": "thinking", "data": f"Leyendo {len(reranked_nodes)} fragmentos relevantes..."}
        
        # Stream LLM response
        response_gen = self.llm.stream_complete(qa_template_str)
        for delta in response_gen:
            yield {"event": "message", "data": delta.delta}
        
        yield {"event": "done", "data": ""}
    

    Configuration

    RAG system parameters are defined in src/core/config.py:
    class Config:
        OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
        MODEL_NAME = "gpt-4o"
        TEMPERATURE = 0
        
        PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
        PINECONE_INDEX = "sentinel-ai-index"
        
        LLAMA_CLOUD_API_KEY = os.getenv("LLAMA_CLOUD_API_KEY")
        EMBED_MODEL = "text-embedding-3-small"
        EMBEDDING_DIM = 1536
        
        COHERE_API_KEY = os.getenv("COHERE_API_KEY")
    

    Query Example

    When the diagnose node queries the knowledge base:
    # From src/agent/nodes/diagnose.py:40-42
    rag_context = ""
    if kb:
        rag_context = kb.query(f"How to fix: {error}")
    
    Example flow for error: "Servicio 'postgresql' no esta activo"
    1. Rewritten queries:
      • “PostgreSQL service not running troubleshooting”
      • “Servicio PostgreSQL caído solución”
      • “pg_ctl restart command”
      • “postgresql.conf, pg_hba.conf, listen_addresses”
      • “Database connection refused fix”
    2. Retrieved chunks: 30 candidates from manuals
    3. Reranked chunks: Top 5 most relevant (e.g., PostgreSQL service management, pg_ctl commands, common errors)
    4. Synthesized response: “Para reiniciar PostgreSQL, usa sudo service postgresql restart. Si el problema persiste, verifica los logs en /var/log/postgresql/...

    Performance Characteristics

    Query Latency

    • Query rewriting: ~1-2s
    • Vector retrieval: ~500ms
    • Reranking: ~300ms
    • LLM synthesis: ~3-5s
    • Total: ~5-8 seconds

    Retrieval Metrics

    • Candidates retrieved: ~30 chunks
    • Final context: 5 chunks
    • Avg context size: 2-3k tokens
    • Deduplication rate: ~20%

    Key Design Decisions

    Multiple perspectives increase recall. A single query may miss relevant documents due to vocabulary mismatch (e.g., “service down” vs “daemon stopped” vs “proceso caído”).
    Vector similarity alone produces false positives. Neural reranking uses cross-attention between query and document to better assess relevance, improving precision by 30-40%.
    The documentation corpus contains both English (official docs) and Spanish (translated/local docs). Bilingual search ensures comprehensive coverage.
    Deterministic responses are critical for DevOps. The same error should always produce the same diagnosis, enabling reproducibility and debugging.

    Build docs developers (and LLMs) love