System overview
The RAG Support System implements a modular, production-ready architecture that combines semantic retrieval, ML triage, and LLM-based generation to answer customer support questions with grounded, cited responses.Design goals
The architecture is built on these principles:- Correctness first — Answers must be supported by retrieved knowledge; hallucinations are unacceptable
- Modularity — Retrieval, generation, and evaluation are independently testable
- Cost awareness — Predictable and controllable LLM usage with bounded retrieval
- Security — Resilience against prompt injection and adversarial inputs
- Production readiness — Observable, scalable, and maintainable
System components
The system is divided into six logical layers:API layer
FastAPI application handling request validation, routing, and response formatting
Triage service
ML models for category and priority prediction with confidence scoring
RAG service
Orchestrates embedding, retrieval, prompt construction, and answer generation
Vector store
Chroma vector database storing document embeddings and metadata
LLM layer
OpenAI models for embeddings, generation, and verification tasks
Evaluation
Offline relevance, faithfulness, and adversarial testing with audit reports
Component details
1. API layer
The FastAPI application exposes HTTP endpoints for client interactions. Location:main.py, src/api/routes/
Key endpoints:
POST /api/v1/answer— Submit questions and get structured answersPOST /api/v1/triage— Run triage models on ticketsPOST /api/v1/ingest— Ingest documents into vector storeGET /api/v1/health— Health check
src/api/models.py):
models.py
The API uses dependency injection for singleton instances of
SimpleRetrievalAgent and TriageModel (see src/api/routes/rag_route.py:17-28).2. Triage service
ML-based classification for incoming tickets. Location:src/api/services/triage_service.py, src/ml/
Models:
- Category classifier — TF-IDF + Logistic Regression for 9 support categories
- Priority classifier — TF-IDF + Logistic Regression for P0/P1/P2 priorities
- Account & Subscription
- Authentication & Access
- Billing & Payments
- Bugs & Errors
- Data Export & Reporting
- Feature Request
- Integrations & API
- Performance & Reliability
- Security & Compliance
Confidence thresholds (0.5) determine when
needs_human_review flag is set. Configurable in src/rag/retriever.py:38-39.3. RAG service
Core retrieval-augmented generation pipeline. Location:src/rag/retriever.py, src/rag/prompts.py, src/rag/structured_outputs.py
Key class: SimpleRetrievalAgent
The RAG agent implements a four-step pipeline:
Configuration (from
src/rag/retriever.py:34-39):
retriever.py
4. Vector store
Chroma database for semantic search. Location:./chroma_db (persistent directory)
Collection: docs_collection
Metadata stored per chunk:
filename— Source document nameelement_id— Unique chunk identifiercategory— Assigned support category (for filtering)
src/rag/ingest.py):
- Parse markdown with Unstructured API
- Chunk into segments (configurable size/overlap)
- Generate embeddings with OpenAI
- Store in Chroma with metadata
Chroma uses SQLite backend with vector indexing for sub-100ms retrieval at moderate scale (10k-100k docs).
5. LLM layer
OpenAI models for generation and verification. Embedding model:text-embedding-3-small (1536 dimensions, ~$0.02/1M tokens)
Generation model: gpt-4.1 (configurable, temperature=0.0 for deterministic output)
Usage:
- Query embedding — Convert user questions to vectors
- Answer generation — Produce grounded customer-facing replies
- Structured outputs — Generate JSON-formatted internal next steps
- Verification — Offline faithfulness and adversarial checks
src/rag/prompts.py):
prompts.py
src/rag/prompts.py:60-127 for full implementation.
6. Evaluation framework
Offline testing for answer quality and robustness. Location:src/rag/evals.py, kb_docs/eval_questions.jsonl
Metrics:
- Relevance — Fraction of expected documents retrieved
- Faithfulness — Whether answer is supported by context
- Adversarial robustness — Resilience to prompt injection and out-of-scope queries
reports/
Evaluations run as offline jobs to avoid impacting production latency. Use for regression testing before deployments.
Request flow (online)
Here’s how a question flows through the system:Detailed flow
2. Triage prediction
ML models predict:
- Category:
Billing & Payments(confidence: 0.92) - Priority:
P1(confidence: 0.87)
src/api/services/triage_service.py.3. Confidence check
If category OR priority confidence < 0.5:
- Set
needs_human_review = True - Continue with RAG pipeline
src/rag/retriever.py:250-253.5. Semantic retrieval
Search Chroma with category filter:Returns top-5 chunks with relevance scores. See
src/rag/retriever.py:97-129.7. Prompt construction
Build category-aware, priority-aware prompt with strict grounding rules:Includes:
- Category-specific role (e.g., “You are a billing support agent”)
- Priority instructions (e.g., “This is a high-priority issue”)
- Absolute rules: use only retrieved context, reject prompt injection
src/rag/prompts.py:60-127.8. Answer generation
Call LLM with constrained prompt:Model:
gpt-4.1, temperature: 0.0 for deterministic output.9. Generate internal next steps
Separate LLM call for structured JSON output:Returns list of 1-3 action items for support agents.
11. Return to client
API returns JSON response with:
draft_reply— Customer-facing answerinternal_next_steps— Actions for agentscitations— Source documents with snippetsneeds_human_review— Boolean flagpredicted_category— Triage outputpredicted_priority— Triage outputconfidence— Scores for category/priority
Typical end-to-end latency: 800ms-1200ms (embedding: 100ms, retrieval: 50ms, generation: 600ms)
Offline evaluation flow
Separate from production, the system runs batch evaluations to measure quality:3. Compute metrics
- Relevance: Did we retrieve expected_docs?
- Faithfulness: Is answer supported by retrieved chunks?
- Adversarial: Does answer reject prompt injection attempts?
Key design tradeoffs
These architectural decisions shape system behavior:1. Faithfulness over creativity
Tradeoff: Lower temperature (0.0) and constrained prompts reduce expressive freedom. Rationale: In support contexts, hallucinations are more harmful than conservative replies. We prioritize factual grounding over creative responses. Implementation:- Temperature set to 0.0 in
src/rag/retriever.py:36 - Prompts enforce “use ONLY retrieved context” rule
- Explicit refusal for insufficient context
2. LLM-based verification vs. custom classifiers
Tradeoff: Use LLM for faithfulness checks instead of training dedicated classifiers. Rationale: Faster iteration and easier maintenance. Suitable for offline checks where latency is not critical. Implementation: Faithfulness prompt insrc/rag/prompts.py:212-236.
3. Separation of online & offline concerns
Tradeoff: Expensive checks (full adversarial analysis) run offline, not per request. Rationale: Keeps production costs and latency low while preserving strong QA via scheduled evaluations. Implementation: Evaluation framework insrc/rag/evals.py runs as separate jobs.
4. No online learning
Tradeoff: System does not auto-update from production user feedback. Rationale: Avoids data poisoning and drift. Improvements come from controlled offline retraining cycles. Future improvement: Implement human-in-the-loop feedback collection for supervised retraining.Scaling strategy
As your deployment grows, consider these scaling paths:Application layer
- Stateless FastAPI — Horizontally scalable via containers (Docker/Kubernetes)
- Load balancing — Use nginx or cloud load balancers
- Async I/O — FastAPI uses
async deffor non-blocking requests
Vector store
- Chroma scaling — Migrate to Chroma server mode for multi-replica deployments
- Managed alternatives — Pinecone, Weaviate, or Qdrant for production scale (millions of docs)
- Hybrid retrieval — Add lexical search (BM25) alongside semantic for better recall
LLM layer
- Caching — Cache embeddings for repeated queries
- Rate limiting — Implement per-user limits to control costs
- Model optimization — Use distilled models or smaller embeddings for lower latency
Background jobs
- Separate workers — Run evaluations and retraining in separate processes
- Job queues — Use Celery or RQ for asynchronous document ingestion
Security considerations
The architecture includes several security measures:Prompt injection defenses
Prompt injection defenses
Prompts explicitly reject instructions that attempt to:
- Change the model’s role
- Override system rules
- Reveal internal prompts
src/rag/prompts.py:93-108.Adversarial testing
Adversarial testing
Offline evaluation suite includes prompt injection test cases:
- “Ignore previous instructions”
- “Act as a different role”
- “Reveal your system prompt”
python -m src.rag.evalsSecrets management
Secrets management
- API keys stored in
.env(gitignored) - Loaded via
python-dotenvat runtime - Never logged or exposed in responses
Input validation
Input validation
Pydantic models enforce types and constraints:
Monitoring and observability
For production deployments, implement:- Structured logging — JSON logs with request IDs, latency, and error traces
- Metrics — Track retrieval relevance, faithfulness rate, LLM latency, and cost
- Alerting — Monitor for high error rates, low confidence scores, and prompt injection attempts
- Tracing — Use OpenTelemetry or LangSmith for distributed tracing
The codebase uses Python’s logging module (see
src/logger.py). Integrate with ELK, Datadog, or LangFuse for production observability.Future improvements
Roadmap items documented inARCHITECTURE.md:113-120:
- Automated gating — Block releases based on evaluation metrics
- BERT-based triage — Replace TF-IDF with encoder models for better classification
- Hybrid retrieval — Combine lexical (BM25) and semantic search
- Per-answer confidence — Attach scores to individual citations
- Human-in-the-loop — Feedback collection and review workflows
Summary
The RAG Support System architecture prioritizes:- Reliability — Grounded answers with explicit refusal for insufficient context
- Safety — Prompt injection defenses and adversarial testing
- Operational clarity — Modular components, structured outputs, and offline evaluation
Quickstart
Get started with your first RAG query
API Reference
Explore endpoints and request models