Question-Answering API

Overview

The QA API is a FastAPI service that provides intelligent question-answering for a Tableau course using Retrieval-Augmented Generation (RAG). The service embeds course transcripts, retrieves relevant context, and generates answers with citation tracking and hallucination detection.

Architecture

Framework: FastAPI 1.1.0
LLM Provider: OpenAI (configurable model)
Vector Store: ChromaDB with OpenAI embeddings
Orchestration: LangChain for RAG pipeline
Document Processing: Markdown header splitting + token chunking

RAG Pipeline

Document Loading: PDF transcripts loaded via PyPDFLoader
Splitting: Markdown header splitter (section/lecture) + token splitter (350 tokens, 50 overlap)
Embedding: OpenAI embeddings (text-embedding-3-small default)
Storage: ChromaDB collection (tableau_qa_collection)
Retrieval: Top-k=4 most relevant chunks
Generation: ChatOpenAI with zero temperature for deterministic answers

Request/Response Schemas

QARequest

class QARequest(BaseModel):
    question_lecture: str = Field(..., min_length=1)
    question_title: str = Field(..., min_length=1)
    question_body: str = Field(..., min_length=1)

QAResponse

class QAResponse(BaseModel):
    answer: str
    confidence: float
    citations: List[str]
    latency_ms: float
    retrieval_accuracy: float
    hallucination_flag: bool

API Endpoints

POST /qa

Synchronous question-answering endpoint. Request Example:

curl -X POST http://localhost:8001/qa \
  -H "Content-Type: application/json" \
  -d '{
    "question_lecture": "Calculations",
    "question_title": "Understanding SUM in GM%",
    "question_body": "Why do we need to wrap numerator and denominator in SUM() for gross margin percentage calculations?"
  }'

Response Example:

{
  "answer": "In Tableau, when calculating gross margin percentage (GM%), we use SUM() around both the numerator and denominator to ensure proper aggregation at the visualization level. Without SUM(), Tableau would calculate row-level percentages before aggregating, leading to incorrect results.\n\nCitations:\n- [Section: Calculations, Lecture: Adding a custom calculation]",
  "confidence": 0.85,
  "citations": [
    "[Section: Calculations, Lecture: Adding a custom calculation]"
  ],
  "latency_ms": 1234.56,
  "retrieval_accuracy": 1.0,
  "hallucination_flag": false
}

POST /qa/stream

Streaming question-answering endpoint using Server-Sent Events (SSE). Request Example:

curl -X POST http://localhost:8001/qa/stream \
  -H "Content-Type: application/json" \
  -d '{
    "question_lecture": "Visual Analytics",
    "question_title": "Chart Selection",
    "question_body": "When should I use bar charts vs line charts?"
  }'

Response Format (Server-Sent Events):

data: {"token": "Bar"}

data: {"token": " charts"}

data: {"token": " are"}

data: {"token": " best"}

data: {"done": true, "confidence": 0.8, "citations": ["[Section: Visual Analytics, Lecture: Building charts]"], "latency_ms": 2345.67, "retrieval_accuracy": 1.0, "hallucination_flag": false}

GET /health

Response Schema:

class HealthResponse(BaseModel):
    ready: bool

curl http://localhost:8001/health

GET /monitoring

Returns aggregated QA performance metrics. Response Schema:

class MonitoringResponse(BaseModel):
    requests_total: int
    avg_latency_ms: float
    avg_retrieval_accuracy: float
    hallucination_rate: float

Example:

curl http://localhost:8001/monitoring

{
  "requests_total": 127,
  "avg_latency_ms": 1523.45,
  "avg_retrieval_accuracy": 0.9449,
  "hallucination_rate": 0.0315
}

Configuration

Environment variables:

OPENAI_API_KEY: Required. OpenAI API key for embeddings and chat
OPENAI_CHAT_MODEL: Chat model name (default: gpt-4o-mini)
OPENAI_EMBEDDING_MODEL: Embedding model (default: text-embedding-3-small)
QA_TRANSCRIPT_PDF: Path to course transcript PDF (default: tableau_course_transcript.pdf)
QA_CHROMA_COLLECTION: ChromaDB collection name (default: tableau_qa_collection)

Document Processing

Transcript processing pipeline (src/qa_api.py:163-184):

Load PDF using PyPDFLoader
Split by Markdown headers:
- # → section
- ## → lecture
- ### → topic
Token-based chunking:
- Chunk size: 350 tokens
- Overlap: 50 tokens
Embed chunks with OpenAI embeddings
Store in ChromaDB

If PDF is missing, the service uses a fallback sample transcript.

Prompt Engineering

System prompt (src/qa_api.py:78-88):

PROMPT_RETRIEVING_S = """You are a helpful teaching assistant for a Tableau course.
You will receive a student question and supporting context passages.

Rules:
1) Answer ONLY using the supplied context.
2) If context is insufficient, say exactly: "I don't have enough context to answer confidently."
3) Add a short "Citations" section at the end.
4) Each citation must use this format:
   - [Section: <section>, Lecture: <lecture>]
5) Do not invent citations.
"""

Citation Validation

The service extracts citations using regex pattern matching (src/qa_api.py:104-106):

pattern = r"\[Section:\s*.*?,\s*Lecture:\s*.*?\]"

Citations are validated against retrieved documents to compute retrieval_accuracy:

1.0: All citations match retrieved documents
< 1.0: Some citations are hallucinated (triggers hallucination_flag)

Confidence Scoring

Confidence is computed using a weighted formula (src/qa_api.py:126-131):

coverage = min(len(retrieved_docs) / 4.0, 1.0)
nonempty = 1.0 if len(answer_text.strip()) > 20 else 0.0
confidence = 0.4 * coverage + 0.4 * retrieval_accuracy + 0.2 * nonempty

Factors:

40%: Coverage (number of retrieved documents)
40%: Retrieval accuracy (citation validity)
20%: Non-empty answer check

Fallback Behavior

If OPENAI_API_KEY is not set or retrieval fails:

{
  "answer": "I don't have enough context to answer confidently.",
  "confidence": 0.0,
  "citations": [],
  "latency_ms": 12.34,
  "retrieval_accuracy": 0.0,
  "hallucination_flag": false
}

Starting the Service

Start the QA API:

export OPENAI_API_KEY=sk-...
uvicorn src.qa_api:app --host 0.0.0.0 --port 8001

With auto-reload:

uvicorn src.qa_api:app --reload --host 0.0.0.0 --port 8001

Monitoring Metrics

The service tracks in-memory metrics for all requests:

requests_total: Total QA requests processed
latency_ms_total: Cumulative latency (sum)
retrieval_accuracy_total: Cumulative retrieval accuracy (sum)
hallucination_count: Number of requests with invalid citations

Metrics are updated after each request (src/qa_api.py:134-139).

Error Handling

503 Service Unavailable: LLM not configured (missing API key)
422 Unprocessable Entity: Invalid request fields
500 Internal Server Error: Unexpected errors during retrieval or generation

Prediction API - Student purchase prediction service
Monitoring - Drift detection and system health

Getting Started

Core Concepts

Training

Deployment

Optimization

Runtime

Overview

Architecture

RAG Pipeline

Request/Response Schemas

QARequest

QAResponse

API Endpoints

POST /qa

POST /qa/stream

GET /health

GET /monitoring

Configuration

Document Processing

Prompt Engineering

Citation Validation

Confidence Scoring

Fallback Behavior

Starting the Service

Monitoring Metrics

Error Handling

Build docs developers (and LLMs) love

Getting Started

Core Concepts

Training

Deployment

Optimization

Runtime

​Overview

​Architecture

​RAG Pipeline

​Request/Response Schemas

​QARequest

​QAResponse

​API Endpoints

​POST /qa

​POST /qa/stream

​GET /health

​GET /monitoring

​Configuration

​Document Processing

​Prompt Engineering

​Citation Validation

​Confidence Scoring

​Fallback Behavior

​Starting the Service

​Monitoring Metrics

​Error Handling

​Related

Build docs developers (and LLMs) love

Overview

Architecture

RAG Pipeline

Request/Response Schemas

QARequest

QAResponse

API Endpoints

POST /qa

POST /qa/stream

GET /health

GET /monitoring

Configuration

Document Processing

Prompt Engineering

Citation Validation

Confidence Scoring

Fallback Behavior

Starting the Service

Monitoring Metrics

Error Handling

Related