Skip to main content
Sentinel AI uses multiple AI models for reasoning, embeddings, and retrieval. This guide covers model configuration and customization.

Model Architecture

Sentinel AI’s AI stack consists of:

Language Model

GPT-4o for reasoning, planning, and decision-making

Embedding Model

text-embedding-3-small for document vectorization

Vector Database

Pinecone for semantic search and knowledge retrieval

Reranker

Cohere Rerank for improving search relevance

Primary Language Model

The main reasoning engine uses OpenAI’s GPT-4o model:
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
MODEL_NAME = "gpt-4o"
TEMPERATURE = 0

Model Parameters

MODEL_NAME
string
default:"gpt-4o"
The OpenAI model to use for reasoning and text generation.Supported models:
  • gpt-4o - Latest GPT-4 optimized model (recommended)
  • gpt-4-turbo - Fast GPT-4 variant
  • gpt-4 - Standard GPT-4
  • gpt-3.5-turbo - Faster, cheaper alternative
Defined in: src/core/config.py:11
TEMPERATURE
float
default:"0"
Sampling temperature for response generation.
  • 0 - Deterministic, consistent responses (recommended for DevOps)
  • 0.0-0.3 - Focused, predictable outputs
  • 0.4-0.7 - Balanced creativity and consistency
  • 0.8-1.0 - More creative, varied responses
Defined in: src/core/config.py:12
Temperature is set to 0 for deterministic DevOps operations. Increase for more creative problem-solving.

Customizing the Language Model

To use a different model, modify src/core/config.py:
class Config:
    MODEL_NAME = "gpt-4-turbo"
    TEMPERATURE = 0
Changing models may affect reasoning quality. Test thoroughly before deploying to production.

Embedding Model

Embeddings convert text into vector representations for semantic search:
EMBED_MODEL = "text-embedding-3-small"
EMBEDDING_MODEL = EMBED_MODEL
EMBEDDING_DIM = 1536

Embedding Parameters

EMBED_MODEL
string
default:"text-embedding-3-small"
OpenAI embedding model for document vectorization.Available models:
  • text-embedding-3-small - 1536 dimensions, fast and efficient (recommended)
  • text-embedding-3-large - 3072 dimensions, higher quality
  • text-embedding-ada-002 - Legacy model, 1536 dimensions
Defined in: src/core/config.py:19-20
EMBEDDING_DIM
integer
default:"1536"
Dimension size for embeddings. Must match the model’s output dimensions.
  • text-embedding-3-small: 1536 dims
  • text-embedding-3-large: 3072 dims
  • text-embedding-ada-002: 1536 dims
Defined in: src/core/config.py:21
If you change the embedding model, update EMBEDDING_DIM to match and recreate the Pinecone index.

Switching Embedding Models

To use a different embedding model:
1

Update configuration

Modify src/core/config.py:
EMBED_MODEL = "text-embedding-3-large"
EMBEDDING_MODEL = EMBED_MODEL
EMBEDDING_DIM = 3072  # Must match model dimensions
2

Delete existing Pinecone index

The index must be recreated with new dimensions:
from pinecone import Pinecone
from src.core.config import config

pc = Pinecone(api_key=config.PINECONE_API_KEY)
pc.delete_index(config.PINECONE_INDEX_NAME)
3

Re-ingest documents

Restart Sentinel AI and re-ingest your documentation:
python main.py
# Use the ingest command to reload manuals

Vector Database (Pinecone)

Pinecone stores document embeddings for semantic search:
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY")
PINECONE_INDEX = "sentinel-ai-index"
PINECONE_INDEX_NAME = PINECONE_INDEX

Pinecone Configuration

PINECONE_INDEX
string
default:"sentinel-ai-index"
Name of the Pinecone index for storing document vectors.Defined in: src/core/config.py:15-16Change this if you want to use a different index or environment.

Index Configuration

The Pinecone index is configured with:
  • Dimension: 1536 (matches text-embedding-3-small)
  • Metric: Cosine similarity
  • Cloud: AWS
  • Region: us-east-1
  • Type: Serverless (auto-scaling)
Modify src/core/knowledge.py:27-36 to change index settings:
self.pc.create_index(
    name=self.index_name,
    dimension=config.EMBEDDING_DIM,
    metric="cosine",  # or "euclidean", "dotproduct"
    spec=ServerlessSpec(
        cloud="aws",  # or "gcp", "azure"
        region="us-west-2"  # change region
    )
)

Reranker (Cohere)

Cohere Rerank improves search relevance by reranking retrieved documents:
COHERE_API_KEY = os.getenv("COHERE_API_KEY")

Reranker Configuration

top_n
integer
default:"5"
Number of documents to return after reranking.Defined in: src/core/knowledge.py:23
  • Lower values (3-5): Faster, more focused results
  • Higher values (10-15): More context, slower processing

Customizing Reranker

Modify src/core/knowledge.py to adjust reranking behavior:
self.reranker = CohereRerank(
    api_key=config.COHERE_API_KEY,
    top_n=10  # Return top 10 after reranking
)

RAG Configuration

Retrieval-Augmented Generation (RAG) combines vector search with LLM reasoning:

Document Chunking

Documents are split into chunks for efficient retrieval:
from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=200
)
chunk_size
integer
default:"1024"
Maximum number of characters per chunk.
  • Smaller chunks (512-1024): More precise, faster search
  • Larger chunks (2048-4096): More context, slower search
Defined in: src/core/knowledge.py:62
chunk_overlap
integer
default:"200"
Number of overlapping characters between chunks.Prevents information loss at chunk boundaries.Defined in: src/core/knowledge.py:62

Query Optimization

Sentinel AI uses query rewriting to improve search accuracy:
def _rewrite_query(self, query_text: str, llm) -> list:
    rewrite_prompt = (
        "You are a search query optimizer for technical documentation..."
        "Generate exactly 5 search queries..."
        "- Query 1: Rephrase in English using official doc terminology\n"
        "- Query 2: Rephrase in Spanish using technical terminology\n"
        "- Query 3: Use specific section names, table names, or chapter references\n"
        "- Query 4: List specific keywords/values the answer would contain\n"
        "- Query 5: Alternative interpretation\n"
        f"User question: {query_text}"
    )
    response = llm.complete(rewrite_prompt)
    queries = [q.strip() for q in str(response).strip().split("\n") if q.strip()]
    queries.insert(0, query_text)  # Include original query
    return queries[:6]
Query rewriting generates 5 variations of each question to improve recall. Results are combined and reranked for relevance.

Retrieval Settings

# Retrieve top 5 documents per query
retriever = self.index.as_retriever(similarity_top_k=5)

# Execute multiple queries
for search_query in search_queries:
    nodes = retriever.retrieve(search_query)
    all_nodes.extend(nodes)

# Rerank combined results
reranked_nodes = self.reranker.postprocess_nodes(
    all_nodes, 
    query_str=english_query
)
similarity_top_k
integer
default:"5"
Number of documents to retrieve per query from Pinecone.Defined in: src/core/knowledge.py:126With 6 query variations × 5 docs each = up to 30 documents retrieved, then reranked to top 5.

LlamaParse Configuration

LlamaParse converts PDF documentation to markdown:
parser = LlamaParse(
    api_key=config.LLAMA_CLOUD_API_KEY,
    result_type="markdown",
    verbose=True,
    language="en",
)

Parser Parameters

  • result_type: Output format (markdown or text)
  • verbose: Enable detailed logging
  • language: Primary document language (en, es, etc.)
Modify src/core/knowledge.py to adjust parsing behavior:
parser = LlamaParse(
    api_key=config.LLAMA_CLOUD_API_KEY,
    result_type="markdown",
    verbose=False,  # Reduce logging
    language="es",  # Spanish documents
    parsing_instruction="Extract technical specifications and commands"
)

Response Generation

The final response is generated using a structured prompt:
qa_template_str = (
    "Eres un asistente tecnico experto. "
    "Se te proporcionan fragmentos de documentacion tecnica oficial.\n"
    "---------------------\n"
    f"{context_str}\n"
    "---------------------\n"
    f"Pregunta: {query_text}\n\n"
    "REGLAS ESTRICTAS:\n"
    "1. Responde SIEMPRE en espanol.\n"
    "2. Usa UNICAMENTE la informacion presente en el contexto.\n"
    "3. NUNCA inventes, supongas ni agregues informacion.\n"
    "4. Si el contexto contiene tablas, reproducelas completas.\n"
    "5. Si la informacion esta repartida, combinalos coherentemente.\n"
    "6. Si no hay informacion suficiente, di exactamente: "
    "'No encontre informacion especifica sobre eso.'"
)

response = self.llm.complete(qa_template_str)
The system prompt enforces strict adherence to source material, preventing hallucinations and ensuring accurate technical responses.

Performance Tuning

Optimize for Speed

# config.py
MODEL_NAME = "gpt-3.5-turbo"  # Faster model
TEMPERATURE = 0

# knowledge.py
SentenceSplitter(chunk_size=512, chunk_overlap=100)  # Smaller chunks
retriever = self.index.as_retriever(similarity_top_k=3)  # Fewer docs
CohereRerank(top_n=3)  # Fewer reranked results

Optimize for Quality

# config.py
MODEL_NAME = "gpt-4o"  # Best reasoning
TEMPERATURE = 0
EMBED_MODEL = "text-embedding-3-large"  # Better embeddings
EMBEDDING_DIM = 3072

# knowledge.py
SentenceSplitter(chunk_size=2048, chunk_overlap=400)  # Larger context
retriever = self.index.as_retriever(similarity_top_k=10)  # More docs
CohereRerank(top_n=10)  # More reranked results

Optimize for Cost

# config.py
MODEL_NAME = "gpt-3.5-turbo"  # Cheaper model
EMBED_MODEL = "text-embedding-3-small"  # Efficient embeddings

# knowledge.py
SentenceSplitter(chunk_size=1024, chunk_overlap=200)  # Balanced
retriever = self.index.as_retriever(similarity_top_k=5)
CohereRerank(top_n=5)

Monitoring and Debugging

Enable Verbose Logging

# In knowledge.py
import logging
logging.basicConfig(level=logging.DEBUG)

# LlamaParse verbose mode
parser = LlamaParse(
    api_key=config.LLAMA_CLOUD_API_KEY,
    result_type="markdown",
    verbose=True  # Enable detailed parsing logs
)

Test RAG Pipeline

from src.core.knowledge import VectorKnowledgeBase

def test_rag():
    kb = VectorKnowledgeBase()
    
    # Test query
    query = "How do I restart PostgreSQL?"
    response = kb.query(query)
    
    print("Query:", query)
    print("Response:", response)
    
    # Test retrieval
    retriever = kb.index.as_retriever(similarity_top_k=5)
    nodes = retriever.retrieve(query)
    print(f"Retrieved {len(nodes)} nodes")
    for node in nodes:
        print(f"- {node.metadata.get('file_name')}: {node.get_content()[:100]}...")

if __name__ == "__main__":
    test_rag()

Cost Estimation

Estimate API costs for different configurations:
ComponentModelInput CostOutput CostNotes
LLMgpt-4o$2.50/1M tokens$10/1M tokensMain reasoning
LLMgpt-3.5-turbo$0.50/1M tokens$1.50/1M tokensBudget option
Embeddingstext-embedding-3-small$0.02/1M tokens-Document vectorization
Embeddingstext-embedding-3-large$0.13/1M tokens-Higher quality
RerankCohere Rerank$1.00/1000 searches-Per query
Vector DBPinecone Serverless$0.10/1M reads$2.00/1M writesStorage + queries
Start with the default configuration (gpt-4o + text-embedding-3-small) for the best balance of quality and cost. Optimize based on your usage patterns.

Next Steps

Environment Variables

Configure API keys and system settings

Services Configuration

Define services to monitor

Knowledge Base

Learn about ingesting documentation

Agent Workflow

Understand how the agent uses these models

Build docs developers (and LLMs) love