Skip to main content
This guide will help you set up the RAG Support System and make your first query. You’ll install dependencies, ingest knowledge base documents, start the API server, and submit a test question.

Prerequisites

Before you begin, ensure you have:
  • Python 3.12+ installed
  • uv package manager (recommended) or pip
  • OpenAI API key for embeddings and LLM
  • Unstructured API key for document parsing
If you don’t have uv installed, get it with: curl -LsSf https://astral.sh/uv/install.sh | sh

Step 1: Clone and install

1

Clone the repository

git clone https://github.com/JoAmps/rgt-assignment.git
cd rgt-assignment
2

Create a virtual environment

python -m venv .venv
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate  # Windows
3

Install dependencies

uv sync
This installs all required packages from pyproject.toml including FastAPI, LangChain, Chroma, and OpenAI.

Step 2: Configure environment variables

Create a .env file in the project root with your API keys:
.env
OPENAI_API_KEY=your_openai_api_key
UNSTRUCTURED_API_KEY=your_unstructured_api_key
Keep your .env file out of version control. Never commit API keys to your repository.

Step 3: Ingest knowledge base documents

Before the RAG system can answer questions, you need to ingest documentation into the vector store.
# Ingest all .md files from kb_docs/ folder
uv run -m src.rag.ingest
Document ingestion chunks your markdown files using Unstructured API, generates embeddings with OpenAI’s text-embedding-3-small, and stores them in Chroma at ./chroma_db.
You should see output like:
Processing: kb_docs/billing.md
Chunked into 12 segments
Stored in Chroma collection: docs_collection

Step 4: Start the API server

Launch the FastAPI development server:
uv run main.py
The server starts at http://localhost:8000 with these endpoints:
  • GET /api/v1/health — Health check
  • POST /api/v1/ingest — Ingest documents
  • POST /api/v1/answer — Submit questions
  • POST /api/v1/triage — Run triage models
Visit http://localhost:8000/api/docs for interactive API documentation.

Step 5: Make your first RAG query

Now that documents are ingested and the API is running, submit a test question:
curl -X POST "http://localhost:8000/api/v1/answer" \
  -H "Content-Type: application/json" \
  -d '{
    "subject": "Refund issue",
    "body": "I was charged twice for my subscription",
    "user_question": "How long does a refund take?"
  }'

Expected response

{
  "draft_reply": "Refunds are typically processed within 5-7 business days. Once approved, the amount will be credited back to your original payment method. You'll receive an email confirmation when the refund is complete.",
  "internal_next_steps": [
    "Verify duplicate charge in billing system",
    "Initiate refund for duplicate transaction",
    "Follow up with customer in 7 days"
  ],
  "citations": [
    {
      "document_name": "billing.md",
      "chunk_id": "element-12",
      "snippet": "Refunds are typically processed...",
      "full_content": "Refunds are typically processed within 5-7 business days..."
    }
  ],
  "needs_human_review": false,
  "predicted_category": "Billing & Payments",
  "predicted_priority": "P1",
  "confidence": {
    "category": 0.92,
    "priority": 0.87
  }
}
The response includes:
  • draft_reply — Customer-facing answer
  • internal_next_steps — Actions for support agents
  • citations — Source documents with snippets
  • needs_human_review — Flag for low-confidence predictions
  • predicted_category and predicted_priority — Triage model outputs

Understanding the response

Let’s break down what just happened:
1

Triage prediction

The system ran ML models to predict:
  • Category: Billing & Payments (confidence: 0.92)
  • Priority: P1 (confidence: 0.87)
High confidence scores mean the prediction is reliable.
2

Semantic retrieval

The RAG agent:
  1. Embedded your question with text-embedding-3-small
  2. Searched Chroma for top 5 similar chunks filtered by predicted category
  3. Retrieved relevant context from billing.md
3

Answer generation

The LLM (GPT-4.1) generated a grounded answer using:
  • Retrieved context from the knowledge base
  • Category-specific system prompt
  • Priority-aware tone adjustments
The answer is constrained to ONLY use information from retrieved documents.
4

Structured outputs

The system generated:
  • Citations with source documents and snippets
  • Internal next steps for support agents
  • Review flag based on confidence thresholds (0.5 for category and priority)

What’s happening under the hood?

Here’s the code that powers your query, from src/rag/retriever.py:201-260:
retriever.py
def answer(
    self,
    query: str,
    predicted_category: str,
    priority: str,
    confidence: Dict[str, float],
    k: int = 5,
) -> Dict:
    """
    End-to-end RAG pipeline.
    """
    # Retrieve top-k chunks filtered by category
    chunks = self.retrieve(
        query,
        predicted_category=predicted_category,
        k=k,
    )

    if not chunks:
        return {
            "draft_reply": "Insufficient context. Please clarify your request.",
            "internal_next_steps": [],
            "citations": [],
            "needs_human_review": True,
        }

    # Assemble context from retrieved chunks
    context_text = "\n\n".join(c["content"] for c in chunks)

    # Generate grounded answer
    answer = self.generate_answer(
        context=context_text,
        query=query,
        predicted_category=predicted_category,
        priority=priority,
    )

    # Generate internal next steps
    internal_next_steps = generate_internal_next_steps(
        context=context_text,
        query=query,
    )

    # Flag for human review if confidence is low
    needs_human_review = (
        confidence.get("category", 0) < CATEGORY_CONF_THRESHOLD
        or confidence.get("priority", 0) < PRIORITY_CONF_THRESHOLD
    )

    return self.format_response(
        answer=answer,
        internal_next_steps=internal_next_steps,
        chunks=chunks,
        needs_human_review=needs_human_review,
    )

Next steps

Now that you’ve made your first query, explore more:

Train triage models

Train custom ML models on your support tickets:
uv run -m src.ml.train

Run evaluations

Test answer quality with offline metrics:
python -m src.rag.evals

Explore the architecture

Learn how system components work together

Read API docs

Dive into endpoint specifications and request models

Troubleshooting

Ensure you’ve activated your virtual environment and run uv sync:
source .venv/bin/activate
uv sync
Verify your .env file contains a valid OPENAI_API_KEY and is in the project root directory.
Make sure you’ve ingested documents before querying:
uv run -m src.rag.ingest
Check that ./chroma_db directory exists and contains data.
Train triage models before making queries:
uv run -m src.ml.train
Trained models are saved to artifacts/ and required for the /answer endpoint.

Build docs developers (and LLMs) love