Architecture - RAG Support System

This page explains how the RAG Support System is architected, how requests flow through the system, and the key design tradeoffs that shape its behavior.

System overview

The RAG Support System implements a modular, production-ready architecture that combines semantic retrieval, ML triage, and LLM-based generation to answer customer support questions with grounded, cited responses.

Design goals

The architecture is built on these principles:

Correctness first — Answers must be supported by retrieved knowledge; hallucinations are unacceptable
Modularity — Retrieval, generation, and evaluation are independently testable
Cost awareness — Predictable and controllable LLM usage with bounded retrieval
Security — Resilience against prompt injection and adversarial inputs
Production readiness — Observable, scalable, and maintainable

System components

The system is divided into six logical layers:

API layer

FastAPI application handling request validation, routing, and response formatting

Triage service

ML models for category and priority prediction with confidence scoring

RAG service

Orchestrates embedding, retrieval, prompt construction, and answer generation

Vector store

Chroma vector database storing document embeddings and metadata

LLM layer

OpenAI models for embeddings, generation, and verification tasks

Evaluation

Offline relevance, faithfulness, and adversarial testing with audit reports

Component details

1. API layer

The FastAPI application exposes HTTP endpoints for client interactions. Location: main.py, src/api/routes/ Key endpoints:

POST /api/v1/answer — Submit questions and get structured answers
POST /api/v1/triage — Run triage models on tickets
POST /api/v1/ingest — Ingest documents into vector store
GET /api/v1/health — Health check

Request/response models (from src/api/models.py):

models.py

class AnswerRequest(BaseModel):
    subject: str
    body: str
    user_question: str

class AnswerResponse(BaseModel):
    draft_reply: str
    internal_next_steps: List[str]
    citations: List[Citation]
    needs_human_review: bool
    predicted_category: str
    predicted_priority: str
    confidence: Dict[str, float]

The API uses dependency injection for singleton instances of SimpleRetrievalAgent and TriageModel (see src/api/routes/rag_route.py:17-28).

2. Triage service

ML-based classification for incoming tickets. Location: src/api/services/triage_service.py, src/ml/ Models:

Category classifier — TF-IDF + Logistic Regression for 9 support categories
Priority classifier — TF-IDF + Logistic Regression for P0/P1/P2 priorities

Categories supported:

Account & Subscription
Authentication & Access
Billing & Payments
Bugs & Errors
Data Export & Reporting
Feature Request
Integrations & API
Performance & Reliability
Security & Compliance

Output: Predicted category, priority, and confidence scores (0-1).

Confidence thresholds (0.5) determine when needs_human_review flag is set. Configurable in src/rag/retriever.py:38-39.

3. RAG service

Core retrieval-augmented generation pipeline. Location: src/rag/retriever.py, src/rag/prompts.py, src/rag/structured_outputs.py Key class: SimpleRetrievalAgent The RAG agent implements a four-step pipeline:

Embed query

Converts user question to vector using text-embedding-3-small

Retrieve chunks

Searches Chroma for top-k similar documents, filtered by predicted category

Generate answer

Passes retrieved context and category-aware prompt to LLM

Format response

Assembles structured output with citations, next steps, and review flag

Configuration (from src/rag/retriever.py:34-39):

retriever.py

DEFAULT_EMBEDDING_MODEL = "text-embedding-3-small"
DEFAULT_LLM_MODEL = "gpt-4.1"
DEFAULT_TEMPERATURE = 0.0

CATEGORY_CONF_THRESHOLD = 0.5
PRIORITY_CONF_THRESHOLD = 0.5

4. Vector store

Chroma database for semantic search. Location: ./chroma_db (persistent directory) Collection: docs_collection Metadata stored per chunk:

filename — Source document name
element_id — Unique chunk identifier
category — Assigned support category (for filtering)

Ingestion process (from src/rag/ingest.py):

Parse markdown with Unstructured API
Chunk into segments (configurable size/overlap)
Generate embeddings with OpenAI
Store in Chroma with metadata

Chroma uses SQLite backend with vector indexing for sub-100ms retrieval at moderate scale (10k-100k docs).

5. LLM layer

OpenAI models for generation and verification. Embedding model: text-embedding-3-small (1536 dimensions, ~$0.02/1M tokens) Generation model: gpt-4.1 (configurable, temperature=0.0 for deterministic output) Usage:

Query embedding — Convert user questions to vectors
Answer generation — Produce grounded customer-facing replies
Structured outputs — Generate JSON-formatted internal next steps
Verification — Offline faithfulness and adversarial checks

Prompt templates (from src/rag/prompts.py):

prompts.py

def generate_prompt(
    category: str,
    context_text: str,
    query: str,
    priority: str,
) -> str:
    """
    Generate customer-facing prompt with:
    - Category-specific role
    - Priority-aware tone
    - Strict grounding rules
    - Prompt injection defenses
    """

See src/rag/prompts.py:60-127 for full implementation.

6. Evaluation framework

Offline testing for answer quality and robustness. Location: src/rag/evals.py, kb_docs/eval_questions.jsonl Metrics:

Relevance — Fraction of expected documents retrieved
Faithfulness — Whether answer is supported by context
Adversarial robustness — Resilience to prompt injection and out-of-scope queries

Output: Markdown and JSON reports in reports/

Evaluations run as offline jobs to avoid impacting production latency. Use for regression testing before deployments.

Request flow (online)

Here’s how a question flows through the system:

Detailed flow

1. Client submits request

POST to /api/v1/answer with:

{
  "subject": "Refund issue",
  "body": "I was charged twice",
  "user_question": "How long does a refund take?"
}

Handled by src/api/routes/rag_route.py:48-69.

2. Triage prediction

ML models predict:

Category: Billing & Payments (confidence: 0.92)
Priority: P1 (confidence: 0.87)

Implementation in src/api/services/triage_service.py.

3. Confidence check

If category OR priority confidence < 0.5:

Set needs_human_review = True
Continue with RAG pipeline

See src/rag/retriever.py:250-253.

4. Query embedding

Embed question with OpenAI text-embedding-3-small:

embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=OPENAI_API_KEY,
)
query_vector = embeddings.embed_query(query)

5. Semantic retrieval

Search Chroma with category filter:

results = self.vectordb.similarity_search_with_relevance_scores(
    query,
    k=5,
    filter={"category": predicted_category},
)

Returns top-5 chunks with relevance scores. See src/rag/retriever.py:97-129.

6. Context assembly

Concatenate retrieved chunks:

context_text = "\n\n".join(c["content"] for c in chunks)

7. Prompt construction

Build category-aware, priority-aware prompt with strict grounding rules:

prompt = generate_prompt(
    predicted_category,
    context_text,
    query,
    priority,
)

Includes:

Category-specific role (e.g., “You are a billing support agent”)
Priority instructions (e.g., “This is a high-priority issue”)
Absolute rules: use only retrieved context, reject prompt injection

Full template in src/rag/prompts.py:60-127.

8. Answer generation

Call LLM with constrained prompt:

return self.llm.invoke([HumanMessage(content=prompt)]).content

Model: gpt-4.1, temperature: 0.0 for deterministic output.

9. Generate internal next steps

Separate LLM call for structured JSON output:

internal_next_steps = generate_internal_next_steps(
    context=context_text,
    query=query,
)

Returns list of 1-3 action items for support agents.

10. Format response

Assemble structured output:

return {
    "draft_reply": answer,
    "internal_next_steps": internal_next_steps,
    "citations": [
        {
            "document_name": chunk["metadata"]["filename"],
            "chunk_id": chunk["metadata"]["element_id"],
            "snippet": chunk["content"][:35],
            "full_content": chunk["content"],
        }
        for chunk in chunks
    ],
    "needs_human_review": needs_human_review,
}

See src/rag/retriever.py:167-195.

11. Return to client

API returns JSON response with:

draft_reply — Customer-facing answer
internal_next_steps — Actions for agents
citations — Source documents with snippets
needs_human_review — Boolean flag
predicted_category — Triage output
predicted_priority — Triage output
confidence — Scores for category/priority

Typical end-to-end latency: 800ms-1200ms (embedding: 100ms, retrieval: 50ms, generation: 600ms)

Offline evaluation flow

Separate from production, the system runs batch evaluations to measure quality:

1. Load test cases

Read JSONL file with expected behaviors:

{
  "ticket_subject": "Refund delay",
  "ticket_body": "My refund was promised 5 days ago",
  "user_question": "Where is my refund?",
  "expected_docs": ["billing.md", "refunds.md"],
  "expected_behavior": "Should explain refund timeline"
}

2. Run RAG pipeline

Execute same online flow for each test case.

3. Compute metrics

Relevance: Did we retrieve expected_docs?
Faithfulness: Is answer supported by retrieved chunks?
Adversarial: Does answer reject prompt injection attempts?

4. Generate reports

Write markdown summaries and JSON logs to reports/:

rag_eval.md — Relevance and faithfulness metrics
adversarial_eval.md — Robustness test results

Run with:

python -m src.rag.evals

Key design tradeoffs

These architectural decisions shape system behavior:

1. Faithfulness over creativity

Tradeoff: Lower temperature (0.0) and constrained prompts reduce expressive freedom. Rationale: In support contexts, hallucinations are more harmful than conservative replies. We prioritize factual grounding over creative responses. Implementation:

Temperature set to 0.0 in src/rag/retriever.py:36
Prompts enforce “use ONLY retrieved context” rule
Explicit refusal for insufficient context

2. LLM-based verification vs. custom classifiers

Tradeoff: Use LLM for faithfulness checks instead of training dedicated classifiers. Rationale: Faster iteration and easier maintenance. Suitable for offline checks where latency is not critical. Implementation: Faithfulness prompt in src/rag/prompts.py:212-236.

3. Separation of online & offline concerns

Tradeoff: Expensive checks (full adversarial analysis) run offline, not per request. Rationale: Keeps production costs and latency low while preserving strong QA via scheduled evaluations. Implementation: Evaluation framework in src/rag/evals.py runs as separate jobs.

4. No online learning

Tradeoff: System does not auto-update from production user feedback. Rationale: Avoids data poisoning and drift. Improvements come from controlled offline retraining cycles. Future improvement: Implement human-in-the-loop feedback collection for supervised retraining.

Scaling strategy

As your deployment grows, consider these scaling paths:

Application layer

Stateless FastAPI — Horizontally scalable via containers (Docker/Kubernetes)
Load balancing — Use nginx or cloud load balancers
Async I/O — FastAPI uses async def for non-blocking requests

Vector store

Chroma scaling — Migrate to Chroma server mode for multi-replica deployments
Managed alternatives — Pinecone, Weaviate, or Qdrant for production scale (millions of docs)
Hybrid retrieval — Add lexical search (BM25) alongside semantic for better recall

LLM layer

Caching — Cache embeddings for repeated queries
Rate limiting — Implement per-user limits to control costs
Model optimization — Use distilled models or smaller embeddings for lower latency

Background jobs

Separate workers — Run evaluations and retraining in separate processes
Job queues — Use Celery or RQ for asynchronous document ingestion

Security considerations

The architecture includes several security measures:

Prompt injection defenses

Prompts explicitly reject instructions that attempt to:

Change the model’s role
Override system rules
Reveal internal prompts

See absolute rules in src/rag/prompts.py:93-108.

Adversarial testing

Offline evaluation suite includes prompt injection test cases:

“Ignore previous instructions”
“Act as a different role”
“Reveal your system prompt”

Run with: python -m src.rag.evals

Secrets management

API keys stored in .env (gitignored)
Loaded via python-dotenv at runtime
Never logged or exposed in responses

Production: Use AWS Secrets Manager, Vault, or similar.

Input validation

Pydantic models enforce types and constraints:

class AnswerRequest(BaseModel):
    subject: str = Field(..., max_length=200)
    body: str = Field(..., max_length=5000)
    user_question: str = Field(..., max_length=500)

Monitoring and observability

For production deployments, implement:

Structured logging — JSON logs with request IDs, latency, and error traces
Metrics — Track retrieval relevance, faithfulness rate, LLM latency, and cost
Alerting — Monitor for high error rates, low confidence scores, and prompt injection attempts
Tracing — Use OpenTelemetry or LangSmith for distributed tracing

The codebase uses Python’s logging module (see src/logger.py). Integrate with ELK, Datadog, or LangFuse for production observability.

Future improvements

Roadmap items documented in ARCHITECTURE.md:113-120:

Automated gating — Block releases based on evaluation metrics
BERT-based triage — Replace TF-IDF with encoder models for better classification
Hybrid retrieval — Combine lexical (BM25) and semantic search
Per-answer confidence — Attach scores to individual citations
Human-in-the-loop — Feedback collection and review workflows

Summary

The RAG Support System architecture prioritizes:

Reliability — Grounded answers with explicit refusal for insufficient context
Safety — Prompt injection defenses and adversarial testing
Operational clarity — Modular components, structured outputs, and offline evaluation

This design is well-suited for real-world customer support where trust, observability, and control matter most.

Quickstart

Get started with your first RAG query

API Reference

Explore endpoints and request models

Getting Started

Core Concepts

Guides

Deployment

​System overview

​Design goals

​System components

API layer

Triage service

RAG service

Vector store

LLM layer

Evaluation

​Component details

​1. API layer

​2. Triage service

​3. RAG service

​4. Vector store

​5. LLM layer

​6. Evaluation framework

​Request flow (online)

​Detailed flow

​Offline evaluation flow

​Key design tradeoffs

​1. Faithfulness over creativity

​2. LLM-based verification vs. custom classifiers

​3. Separation of online & offline concerns

​4. No online learning

​Scaling strategy

​Application layer

​Vector store

​LLM layer

​Background jobs

​Security considerations

​Monitoring and observability

​Future improvements

​Summary

Quickstart

API Reference

Build docs developers (and LLMs) love

System overview

Design goals

System components

Component details

1. API layer

2. Triage service

3. RAG service

4. Vector store

5. LLM layer

6. Evaluation framework

Request flow (online)

Detailed flow

Offline evaluation flow

Key design tradeoffs

1. Faithfulness over creativity

2. LLM-based verification vs. custom classifiers

3. Separation of online & offline concerns

4. No online learning

Scaling strategy

Application layer

Vector store

LLM layer

Background jobs

Security considerations

Monitoring and observability

Future improvements

Summary